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Combinatorial Clustering and the Beta 
Negative Binomial Process 

Tamara Broderick, Lester Mackey, John Paisley, Michael I. Jordan 

Abstract 

We develop a Bayesian nonparametric approach to a general family of latent class problems in which individuals 
can belong simultaneously to multiple classes and where each class can be exhibited multiple times by an individual. 
We introduce a combinatorial stochastic process known as the negative binomial process (NBP) as an infinite- 
dimensional prior appropriate for such problems. We show that the NBP is conjugate to the beta process, and we 
characterize the posterior distribution under the beta negative binomial process (BNBP) and hierarchical models 
based on the BNBP (the HBNBP). We study the asymptotic properties of the BNBP and develop a three-parameter 
extension of the BNBP that exhibits power-law behavior. We derive MCMC algorithms for posterior inference under 
the HBNBP, and we present experiments using these algorithms in the domains of image segmentation, object 
recognition, and document analysis. 



1 Introduction 



In traditional clustering problems the goal is to induce a set of latent classes and to assign each data point 
to one and only one class. This problem has been approached within a model-based framework via the use 
K*" of finite mixture models, where the mixture components characterize the distributions associated with the 
classes, and the mixing proportions capture the mutual exclusivity of the classes (Fraley and Raftery, 2002; 
. . . McLachlan and Basford, 1988). In many domains in which the notion of latent classes is natural, however, it 
is unrealistic to assign each individual to a single class. For example, in genetics, while it may be reasonable 
to assume the existence of underlying ancestral populations that define distributions on observed alleles, 
each individual in an existing population is likely to be a blend of the patterns associated with the ancestral 
populations. Such a genetic blend is known as an admixture (Pritchard et al., 2000). A significant literature 
on model-based approaches to admixture has arisen in recent years (Blei et al., 2003; Erosheva and Fienberg, 
2005; Pritchard et al., 2000), with applications to a wide variety of domains in genetics and beyond, including 
document modeling and image analysis.^ 
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1. While we refer to such models generically as "admixture models," we note that they are also often referred to as topic models or 
mixed membership models. 
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Model-based approaches to admixture are generally biiilt on the foundation of mixture modeling. The basic 
idea is to treat each individual as a collection of data, with an exchangeability assumption imposed for the data 
within an individual but not between individuals. For example, in the genetics domain the intra-individual 
data might be a set of genetic markers, with marker probabilities varying across ancestral populations. In 
the document domain the intra-individual data might be the set of words in a given document, with each 
document (the individual) obtained as a blend across a set of imderlying "topics" that encode probabilities for 
the words. In the image domain, the intra-individual data might be visual characteristics like edges, hue, and 
location extracted from image patches. Each image is then a blend of object classes (e.g., grass, sky, or car), each 
defining a distinct distribution over visual characteristics. In general, this blending is achieved by making use 
of the probabilistic structure of a finite mixture but using a different sampling pattern. In particular, mixing 
proportions are treated as random effects that are drawn once per individual, and the data associated with that 
individual are obtained by repeated draws from a mixture model having that fixed set of mixing proportions. 
The overall model is a hierarchical model, in which mixture components are shared among individuals and 
mixing proportions are treated as random effects. 

Although the literature has focused on using finite mixture models in this context, there has also been 
a growing literature on Bayesian nonparametric approaches to admixture models, notably the hierarchical 
Dirichlet process (HDP) (Teh et al., 2006), where the niimber of shared mixture components is infinite. Our 
focus in the current paper is also on nonparametric methods, given the open-ended nature of the inferential 
objects with which real-world admixture modeling is generally concerned. 

Although viewing an admixture as a set of repeated draws from a mixture model is natural in many 
situations, it is also natural to take a different perspective, akin to latent trait modeling, in which the individual 
(e.g., a document or a genotype) is characterized by the set of "traits" or "features" that it possesses, and 
where there is no assumption of mutual exclusivity. Here the focus is on the individual and not on the "data" 
associated with an individual. Indeed, under the exchangeability assumption alluded to above it is natural 
to reduce the repeated draws from a mixture model to the counts of the numbers of time that each mixture 
component is selected, and we may wish to model these counts directly. We may further wish to consider 
hierarchical models in which there is a linkage among the coimts for different individuals. 

This idea has been made explicit in a recent line of work based on the beta process. Originally developed 
for survival analysis, where an integrated form of the beta process was used as a model for random hazard 
functions (Hjort, 1990), more recently it has been observed that the beta process also provides a natural 
framework for latent feature modeling (Thibaux and Jordan, 2007). In particular, as we discuss in detail in 
Section 2, a draw from the beta process yields an infinite collection of coin-tossing probabilities. Tossing 
these coins — a draw from a Bernoulli process — one obtains a set of binary features that can be viewed as a 
description of an admixed individual. A key advantage of this approach is the conjugacy between the beta 
and Bemoiilli processes: this property allows for tractable inference, despite the countable infinitude of coin- 
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tossing probabilities. A limitation of this approach, however, is its restriction to binary features; indeed, one 
of the virtues of the mixture-model-based approach is that a given mixture component can be selected more 
than once, with the total number of selections being random. 

To develop a more generally useful tool for modeling admixture within a feature-based approach, we note 
that in the setting of classical random variables, beta-Bernoulli conjugacy is not the only form of conjugacy 
involving the beta distribution — the negative binomial distribution is also conjugate to the beta. Anticipating 
the value of conjugacy in the setting of nonparametric models, we are motivated to develop a stochastic process 
analogue of the negative binomial distribution, a stochastic process that is conjugate to the beta process. It is 
one of the contributions of the current paper to define this process, which we refer to as the negative binomial 
process (NBP)^, to provide a rigorous proof of its conjugacy to the beta process, and to explore its properties. 
We develop a model-based approach to admixture modeling based on the NBP, focusing in particiilar on the 
role of the NBP in Bayesian hierarchical models. 

The beta process and the NBP are not the only way to generate infinite vectors of coimts, and indeed 
there has been previous work on nonparametric count models based on the gamma process and the Poisson 
likelihood process (Thibaux, 2008; Titsias, 2008). A second contribution of the current paper is to explore the 
cormections between these stochastic processes and the beta process and NBP. Indeed, although some of the 
connections among the stochastic processes used in Bayesian nonparametrics are well known (e.g., that the 
Dirichlet process can be obtained from the gamma process by normalization), in general there is a far less 
clear view of the linkages between these processes than there is of the linkages between the corresponding 
classical random variables. We are able to establish several novel connections, including a new connection 
between the beta process and the gamma process that makes use of yet another stochastic process that we 
refer to as the beta prime process. 

The remainder of the paper is organized as follows. In Section 2 we present the framework of completely 
random measures that provides the formal underpinnings for our work. We discuss the Bernoulli process, the 
NBP, and their conjugacy to the beta process in Section 3. Section 4 focuses on the problem of modeling ad- 
mixture and on general hierarchical modeling based on the negative binomial process. We explore connections 
between the NBP and other stochastic processes in Section 5. Section 6 and Section 7 are devoted to a study of 
the asymptotic behavior of the NBP with a beta process prior, which we call the beta negative binomial process 
(BNBP). We describe algorithms for posterior inference in Section 8. Finally, we present experimental results. 
First, we demonstrate the utility of the BNBP in the domain of automatic image segmentation in Section 9. 
Second, we use the BNBP to define a generative model for summaries of terrorist incidents with the goal of 
identifying the perpetrator of a given terrorist attack in Section 10. Section 11 presents our conclusions. 

2. We note that Zhou et al. (2012) have independently studied the negative binomial process in recent work, focusing on applications 
to matrix factorization problems. 
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2 Completely Random Measures 

In this section we review the notion of a completely random measure (CRM), a general constiuction that 
yields random measures that are closely tied to classical constructions involving sets of independent random 
variables. We present CRM-based constructions of several of the stochastic processes used in Bayesian non- 
parametrics, including the beta process, gamma process, and Dirichlet process. In the following section we 
biiild on the foundations presented here to consider additional stochastic processes. 

Consider a probability space J", P). A random measure is a random element /x such that ijl{A) is a non- 
negative random variable for any A in the sigma algebra F. A completely random measure (CRM) is a random 
measiire such that, for any disjoint, measiirable sets A, A' e F, we have that /u(A) and /u(^') are independent 
random variables (Kingman, 1967). Completely random measures can be shown to be composed of at most 
three components: 

1) A deterministic measure. For deterministic /Xdet/ it is trivially the case that Hdet{A) and UdetiA') are inde- 
pendent for disjoint A, A'. 

2) A set of fixed atoms. Let (ui, . . . , ml) G be a collection of deterministic locations, and let (tji, . . . , tjl) G 

be a collection of independent random weights for the atoms. The collection may be coimtably 
infinite, in which case we say L = oo. Then let /U/i^ = J2f=i Vi^ui- The independence of the r]i implies 
the complete randomness of the measure. 

3) An ordinary component. Let uhe a Poisson process intensity on the space ^ x M+. Let {(vi, ^i), (i'2, ^2), • • •} 
be a draw from this Poisson process. Then the ordinary component is the measure iiord = ^j^vj- 
Here, the complete randomness follows from properties of the Poisson process. 

One observation from this componentwise breakdown of CRMs is that we can obtain a countably infinite 
set of random variables, the ^j, from the Poisson process component if is sigma-finite. In this light, we might 
see the disjoint independence criterion of the CRM as an extension of an independence assumption in the 
case of a finite set of random variables. We cover specific examples next. 

2.1 Beta process 

We have seen that CRMs have three components. Therefore, in order to describe any CRM, it is enough to 
specify the deterministic measure, fixed atoms, and ordinary component. The beta process (Hjort, 1990; Kim, 
1999; Thibaux and Jordan, 2007) is an example of a CRM. It has the following parameters: a mass parameter 
7 > 0, a concentration parameter > Q, purely atomic measure Hfi^ = piS^ with ^pi G (0, 1) for all / a.s., 
and a purely continuous measure Hard on ^. Often the final two measure parameters are abbreviated as their 
sum: H = Hfi^ + Hard- 

Given these parameters, the beta process has the following description as a CRM: 

1) The deterministic measure is imiformly zero. 
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2) The fixed atoms have locations {ui, . . . ,ul) e where L is potentially infinite though typically finite. 
Atom weight rji has distribution 

m'^^Beta{ejpi,e{l-'ypi)), (1) 

where the pi parameters are the weights in the purely atomic measure Hfix- 

3) The ordinary component has Poisson process intensity Hgrd x i^, where u is the measure 

iy{db) = -r0b-\i - bf-\ (2) 

which is sigma-fiivite with fiivite mean. It follows that the number of atoms in this component will be 
countably infirute with finite sum. 
As Thibaux and Jordan (2007) point out, Eq. 2 can be generalized by allowing to depend on the ^ coordinate. 

The homogeneous intensity in Eq. 2 seems to be used predominantly in practice (Thibaux and Jordan, 2007; 
Fox et al., 2009) though, and we focus on it here for ease of exposition. Nonetheless, we note that our results 
below extend easily to the non-homogeneous case. 

The CRM is the sum of its components. Therefore, we may write a draw from the beta process as 

oo L oo 

fe=l 1=1 j=l 

with atom locations equal to the union of the fixed atom and ordinary component atom locations {tpk}k = 
U {vj}'^^. Notably, B is a.s. discrete. We denote a draw from the beta process as B ~ BP{0,^,H). 
The provenance of the name "beta process" is now clear; each atom weight in the fixed atomic component is 
beta-distributed, and the Poisson process intensity generating the ordinary component is that of an improper 
beta distribution. 

From the above description, the beta process provides a prior on a potentially infinite vector of weights, 
each in (0,1) and each associated with a corresponding parameter t/j E The potential countable infinity 
comes from the Poisson process component. The weights in (0, 1) may be interpreted as probabilities, though 
not as a distribution across the indices as we note that they need not sum to one. We will see in Section 4 that 
the beta process is appropriate for featural modeling (Thibaux and Jordan, 2007; Griffiths and Ghahramani, 
2006). In this context, each atom, indexed by k, of B corresponds to a feature. The atom weights {bk}, which 
are each in [0, 1] a.s., can be viewed as representing the frequency with which each feature occurs in the data 
set. The atom locations {tpk} represent parameters associated with the features that can be used in forming a 
likelihood. 

In Section 6, we will show that an extension to the beta process called the three-parameter beta process has 
certain desirable properties beyond the classic beta process, in particular its ability to generate power-law 
behavior (Teh and GoriAr, 2009; Broderick et al., 2012), which roughly says that the number of features grows 
as a power of the number of data points. In the three-parameter case, we introduce a discount parameter 
a G (0, 1) with 9 > -a and 7 > such that: 



1) There is again no deterininistic component. 

2) The fixed atoms have locations (ui, . . . , ul) e with L potentially infinite but typically finite. Atom 
weight rji has distribution rji '~ Beta {O'jpi — a, 6{1 — ^pi) + a), where the pi parameters are the weights 
in the purely atomic measure Hfi^ and we now have the constraints 9^pi - a, 6{l - ^pi) + a > 0. 

3) The ordinary component has Poisson process intensity -fford x v, where v is the measure: 

Again, we focus on the homogeneous intensity as in the beta process case though it is straightforward to 
allow 6 to depend on coordinates in 

In this case, we again have the full process draw i? as in Eq. 3, and we say B ~ 3BP(a, 6, 7, H). 

2.2 Reparameterized beta process 

The specification that the atom parameters in the beta process be of the form 6'ypi and ^(1 — ^pi) can be 
urmecessarily constraining. Indeed, the classical beta distribution has two free parameters. Yet, in the beta 
process as described above, 6 and 7 are determined as part of the Poisson process intensity, so there is 
essentially one free parameter for each of the beta-distributed weights associated with the atoms (Eq. 1). A 
related problematic issue is that the beta process forces the two parameters in the beta distribution associated 
with each atom to sum to 9, which is constant across all of the atoms. 

One way to remove these restrictions is to allow 9 = 9{ip), a fimction of the position ^ G (Thibaux and 
Jordan, 2007). However, we will see in Section 5 that there are reasons to prefer a fixed concentration parameter 
9 for the ordinary component; there is a fundamental relation between this parameter and similar parameters 
in other common CRMs (e.g., the Dirichlet process, which we describe in Section 2.4). Moreover, the concern 
here is entirely centered on the behavior of the fixed atoms of the process, and letting 9 depend on ip retains 
the unusual — from a classical parametric perspective — form of the beta distribution in Eq. 1. As an alternative, 
we provide a specification of the beta process that more closely aligns with the classical perspective in which 
we allow two general beta parameters for each atom. As we will see, this reparameterization is natural, and 
indeed necessary, in considering conjugacy. 

We thus define the reparameterized beta process (REP) as having the following parameterization: a mass 
parameter 7 > 0, a concentration parameter 6' > 0, a number of fixed atoms L e {0, 1, 2, . . .} U {00} with locations 
{ui,. . . ,ul) € two sets of strictly positive atom weight parameters and {cr;};^i, and a purely 

continuous measure Hard on In this case, the atom weight parameters satisfy the simple condition pi,ai > 
for all Z G {1, . . . , L}. This specification is the same as the beta process specification introduced above with the 
sole exception of a more general parameterization for the fixed atoms. We obtain the following CRM: 

1) There is no determirvistic measure. 



2) There are L fixed atoms with locations {ui, . . . ,ul) e and corresponding weights rji '~ Beta {puai) . 
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3) The ordinary componerit has Poisson process mtensity Hgrd x z^/ where f is the measure ^{db) = 7^6~^(1 — 

As discussed above, we favor the homogeneous intensity u in exposition but note the straightforward extension 

to allow 9 to depend on location. 

We denote this CRM by S ~ RBP(6', 7, u, p, cr, Hord)- 

2.3 Gamma process 

While the beta process provides a countably infinite vector of frequencies in (0, 1] with associated parameters 
V'fe, it is sometimes useful to have a countably infinite vector of positive, real-valued quantities that can be 
used as rates rather than frequencies for features. We can obtain such a prior with the gamma process (Ferguson, 
1973), a CRM with the following parameters: a concentration parameter ^ > 0, a scale parameter c > 0, a purely 
atomic measure Hfix = J2i Pi^ui with V/, pi > 0, and a purely continuous measiure Hord with support on ^. 
Its description as a CRM is as follows (Thibaiix, 2008): 

1) There is no deterministic measure. 

2) The fixed atoms have locations . . . , 71^) e 'i!^ , where L is potentially infinite but typically finite. Atom 
weight rii has distribution rji '~ Gamma(6'p;, c), where we use the shape-inverse-scale parameterization 
of the gamma distribution and where the pi parameters are the weights in the purely atomic measure 
Hfix- 

3) The ordinary component has Poisson process intensity Hord x i', where v is the measure: 

v{dg) = dg~^ exp {-eg) dg. (4) 

As in the case of the beta process, the gamma process can be expressed as the sum of its components: 

G = Efc ~9k5^H = Ef=i mK + J2j ij^v, ■ We denote this CRM as G ~ TV{e, c, H), for H = Hfix + Hord- 

2.4 Dirichlet process 

While the beta process has been used as a prior in featural models, the Dirichlet process is the classic Bayesian 
nonparametric prior for clustering models (Ferguson, 1973; MacEachern and Miiller, 1998; McCloskey, 1965; 
Neal, 2000; West, 1992). The Dirichlet process itself is not a CRM; its atom weights, which represent cluster 
frequencies, must sum to one and are therefore correlated. But it can be obtained by normalizing the gamma 

process (Ferguson, 1973). 

In particular, using facts about the Poisson process (Kingman, 1993), one can check that, when there are 
finitely many fixed atoms, G{^) < 00 a.s.; that is, the total mass of the gamma process is almost surely finite 
despite having infinitely many atoms from the ordinary component. Therefore, normalizing the process by 
dividing its weights by its total mass is well-defined. We thus can define a Dirichlet process as 

k 
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where G ~ rP(^, 1, H), and where there are two parameters: a concentration parameter 9 and a base measure H 
with finitely many fixed atoms. Note that while we have chosen the scale parameter c = 1 in this construction, 
the choice is in fact arbitrary for c > and does not affect the G distribution (Pitman, 2006). 

From this construction, we see immediately that the Dirichlet process is almost surely atomic, a property 
inherited from the gamma process. Moreover, not only are the weights of the Dirichlet process all contained 
in (0, 1) but they further sum to one. Thus, the Dirichlet process may be seen as providing a probability 
distribution on a countable set. In particular, this countable set is often viewed as a countable number of 
clusters, with cluster parameters Vfe- 

3 CONJUGACY AND COMBINATORIAL CLUSTERING 

In Section 2, we introduced CRMs and showed how a number of classical Bayesian nonparametric priors 
can be derived from CRMs. These priors provide infinite-dimensional vectors of real values, which can be 
interpreted as feature frequencies, feature rates, or cluster frequencies. To flesh out such interpretations we 
need to couple these real-valued processes with discrete-valued processes that capture combinatorial structure. 
In particular, viewing the weights of the beta process as feature frequencies, it is natural to consider binomial 
and negative binomial models that transform these frequencies into binary values or nonnegative integer 
counts. In this section we describe stochastic processes that achieve such transformations, again relying on 
the CRM framework. 

The use of a BemoiiUi likelihood whose frequency parameter is obtained from the weights of the beta 
process has been explored in the context of survival models by Hjort (1990) and Kim (1999) and in the context 
of feature modeling by Thibaux and Jordan (2007). After reviewing that construction, we discuss a similar 
construction based on the negative binomial process. Moreover, recalling that Hjort (1990) and Kim (1999) have 
shown that the binomial likelihood is conjugate to the beta process, we demonstrate an analogous conjugacy 
result for the negative binomial process. 

3.1 Bernoulli process 

One way to make use of the beta process is to couple the process to a Bernoulli process (Hjort, 1990; Kim, 1999; 
Thibaux and Jordan, 2007). The Bemoiilli process, denoted BeP{H), has a single parameter, a base measure H; 
H is any discrete measure with atom weights in (0, 1]. Although our focus will be on models in which ^ is a 
draw from a beta process, as a matter of the general defirution of the Bernoulli process the base measure H 
need not be a CRM or even random — ^just as the Poisson distribution is defined relative to a parameter that 
may or may not be random in general but which is sometimes given a gamma distribution prior. Since H is 
discrete by assumption, we may write 

oo 

H = Y,bkhu (5) 

/c=l 
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with bk G (0, 1]. We say that the random measure / is drawn from a BernouUi process, / ~ BeP{H), if 

I = SfeLi h^^k with ik '~ Bern(6fe) for k = 1,2, That is, to form the Bernoulli process, we simply make a 

Bernoulli random variable draw for every one of the (potentially coimtable) atoms of the base measure. 

One interpretation for this construction is that the atoms of the base measure H represent features of an 
individual, with feature frequencies equal to the atom weights and feature characteristics defined by the atom 
locations. The Bernoulli process draw can be viewed as characterizing the individual by the set of features 
whose atom weight is equal to one. Suppose H is derived from a Poisson process as the ordinary component 
of a completely random measure and has finite mass; then the number of features exhibited by the Bernoulli 
process, i.e. the total mass of the Bernoulli process draw, is a.s. finite. Thus the Bernoulli process can be viewed 
as providing a Bayesian nonparametric model of sparse binary feature vectors. 

Now suppose that the base measure parameter is a beta process draw with parameters ^ > 0, 7 > 0, and 
base measure H. That is, B BP(^, 7, H) and I ~ BeP(B). We refer to the overall process as the beta Bernoulli 
process (BBeP). Suppose that the beta process B has a finite number of fixed atoms. Then we note that the 
finite mass of the ordinary component of B implies that / has support on a finite set. That is, even though B 
has a countable infinity of atoms, / has only a finite number of atoms. This observation is important since, in 
any practical model, we will want an individual to exhibit only finitely many features. 

Hjort (1990) and Kim (1999) have shown that the posterior distribution of B under the BBeP is also a 
beta process, with known parameters. Hjort (1990) and Kim (1999) originally established this conjugacy in the 
domain of survival analysis, and Thibaux and Jordan (2007) extended their analysis to feature models. We 
sununarize the result by Thibaux and Jordan (2007) here. 

Theorem 1. The beta process prior is conjugate to the Bernoulli process likelihood. 

Theorem 17 in Appendix D.l describes the conjugacy in more detail; in particular, we enumerate the exact 
posterior parameter values there. 

As shown by Thibaux and Jordan (2007), if the underlying beta process is integrated out in the BBeP, we 
recover the Indian buffet process of Griffiths and Ghahramarvi (2006). 

An easy consequence of Theorem 1 is the following. 

Corollary 2. The RBP prior is conjugate to the Bernoulli process likelihood. 

The parameterization of the posterior is described in detail in Corollary 18 in Appendix D.2 The usefulness 
of the RBP becomes apparent in the posterior parameterization; the distributions associated with the fixed 

atoms more closely mirror the classical parametric conjugacy between the Bernoulli distribution and the beta 
distribution. This is an issue of convenience in the case of the BBeP, but it is more significant in the case of 
the negative binomial process, as we show in the following section, where conjugacy is preserved only in the 
RBP case. 
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3.2 Negative binomial process 

The Bernoulli distribution is not the only distribution that yields conjugacy when coupled to the beta distri- 
bution in the classical parametric setting; conjugacy also holds for the negative binomial distribution. As we 
show in this section, this result can be extended to stochastic processes via the CRM framework. 

We define the negative binomial process as a CRM with two parameters: a shape parameter r > and a 
discrete base measure H = J2k ^fc^iAfe whose atom weights bk take values in (0, 1]. As in the case of the Bernoulli 
process, H need not be random at this point. Since H is discrete, we again have a representation for H as 
in Eq. 5, and we say that the random measure / is drawn from a negative binomial process, I ~ NBP(r, H), 

it I = J2kLi ik^i'k with ik NB(r, 6/j) for k = 1,2, That is, the negative binomial process is formed 

by simply making a single draw from a negative binomial distribution at each of the (potentially coimtably 
infinite) atoms of H. This construction generalizes the geometric process studied by Thibaux (2008). 

As a Bernoulli process draw can be interpreted as assigning a set of features to a data point, so can we 
interpret a negative binomial process draw as assigning a vector of feature counts to a data point. In particular, 
as for the Bernoulli process, we assume that each data point has its own negative binomial process draw. Every 
atom with strictly positive mass in the negative binomial process draw corresponds to a feature that is exhibited 
by this data point. Moreover, the size of the atom, which is a positive integer by construction, dictates how 
many times the feature is exhibited by the data point. For example, if the data point is a document, and 
each feature represents a particular word, then the negative binomial process draw would tell us how many 
occurrences of each word there are in the document. 

If the base measure for a negative binomial process is a beta process, we say that the combined process is 
a beta negative binomial process (BNBP). If the base measure is a three-parameter beta process, we say that the 
combined process is a three-parameter beta negative binomial process (3BNBP). When either the BP or 3BP has 
a finite number of fixed atoms, we have that the ordinary component of the BP or 3BP still has an infinite 
number of atoms, but the number of atoms in the negative binomial process is a.s. finite. We prove this fact 
and more in Section 6. 

We now suppose that the base measure for the negative binomial process is a draw B from an RBP with pa- 
rameters ^ > 0, 7 > 0, {ci}^!, and Hard- The overall specification is B ~ RBP(0, 7, u, p, tr, Hord) 
and I NBP(r, B). The following theorem characterizes the posterior distribution for this model. 

Theorem 3. The (reparameterized) beta process prior is conjugate to the negative binomial process likelihood. 

The parameterization of the posterior is given by the full theorem statement in Theorem 19 in Appendix D.3. 

4 Mixtures and admixtures 

We now assemble the pieces that we have introduced and consider Bayesian nonparametiic models of ad- 
mixture. Recall that the basic idea of an admixture is that an individual (e.g., an organism, a document or an 
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image) can belong simultaneously to multiple classes. This can be represented by associating a binary-valued 
vector with each individual; the vector has value one in components corresponding to classes to which the 
individual belongs and zero in components corresponding to classes to which the individual does not belong. 
More generally, we wish to remove the restriction to binary values and consider a general notion of admixture 

in which an individual is represented by a nonnegative, integer-valued vector. We refer to such vectors as 
feature vectors, and view the components of such vectors as coimts representing the number of times the 
corresponding feature is exhibited by a given individual. For example, a docimient may exhibit a given word 
zero or more times. 

As we discussed in Section 1, the standard approach to modeling an admixture is to assume that there is 
an exchangeable set of data associated with each individual and to assume that these data are drawn from a 
finite mixture model with individual-specific mixing proportions. There is another way to view this process, 
however, that opens the door to a variety of extensions. Note that to draw a set of data from a mixture, we 
can first choose the number of data points to be associated with each mixture component (a vector of counts) 
and then draw the appropriate number of data points independently from each selected mixture component. 
That is, we randomly draw nonnegative integers ik for each mixture component (or cluster) k. Then, for each 
k and each n = l,...,ik, we draw a data point .Tfc„ ~ F{ijjk), where ipk is the parameter associated with 
mixture component k. The overall data set is {xk,n}k,n^ with A'^ = Z^fc^fc total points. One way to generate 
data according to this decomposition is to make use of the NBP. We draw / = J2k '^kSi/j^. ~ NBP(r, B), where 
B is drawn from a beta process, B ~ BP(^, 7, H). The overall model is a BNBP mixture model for the counts, 
coupled to a conditionally independent set of draws for the data points {xk,n}k,n- 

An alternative approach in the same spirit is to make use of a gamma process (to obtain a set of rates) 
that is coupled to a Poisson likelihood process (PLP)^ to convert the rates into counts (Titsias, 2008). In 
particular, given a base measure G = J2k9k^i'k' / ~ PLP(G') denote I = J^k^kSt/j,^, with ik ^ Pois(,gfc). 
We then consider a gamma Poisson process (FPLP) as follows: G ~ rP{9,c,H), I = '^k'^'^^^k ~ ^^^{G), and 
Xk,n ~ Pilpk), for n = 1, . . . , ife and each k. 

Both the BNBP approach and the FPLP approach deliver a random measure, / = J^k'^'k^i'k' ^ repre- 
sentation of an admixed individual. While the atom locations, (tpk), are subsequently used to generate data 
points, the pattern of admixture inheres in the vector of weights {ik). It is thus natural to view this vector as 
the representation of an admixed individual. Indeed, in some problems such a weight vector might itself be 
the observed data. In other problems, the weights may be used to generate data in some more complex way 
that does not simply involve conditionally i.i.d. draws. 

This perspective on admixture — focusing on the vector of weights (ik) rather than the data associated with 

an individual — is also natural when we consider multiple individuals. The main issue becomes that of linking 

3. We use the terminology "Poisson likelihood process" to distinguish a particular process with Poisson distributions affixed to each 
atom of some base distribution from the more general Poisson point process of Kingman (1993). 
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these vectors among multiple individuals, and this can readily be achieved in the Bayesian formalism via a 
hierarchical model. In the remainder of this section we consider examples of such hierarchies in the Bayesian 
nonparametric setting. 

Let us first consider the standard approach to admixture in which an individual is represented by a set of 

draws from a mixture model. For each individual we need to draw a set of mixing proportions, and these 
mixing proportions need to be coupled among the individuals. This can be achieved via a prior known as the 
hierarchical Dirichlet process (HDP) (Teh et al., 2006): 

Go ~ DP (6»,iJ) 

Gd = Y.9d,k5^. DP((^d, Go), d = 1, 2, . . . , 

k 

where the index d ranges over the individuals. Note that the global measure Go is a discrete random probability 
measure, given that it is drawn from a Dirichlet process. In drawing the individual-specific random measure 
Gd at the second level, we therefore resample from among the atoms of Go and do so according to the weights 
of these atoms in Gq. This shares atoms among the individuals and couples the individual-specific mixing 
proportions gd,k- We complete the model specification as follows: 

Zd,n'^ {gd,k)k fOT n=l,...,Nd 

Xd,n '~ -F(V'^rf,„), 

which draws an index z^.n from the discrete distribution {gd,k)k and then draws a data point Xd,n from a 
distribution indexed by „. For instance, {gd,k) might represent topic proportions in document d; ipz^^^ might 
represent a topic, i.e. a distribution over words; and Xd,n might represent the nth word in the dth document. 

As before, an alternative view of this process is that we draw an individual-specific set of counts from an 
appropriate stochastic process and then generate the appropriate number of data points for each individual. 
We also need to couple the coimts across individuals. This can be achieved by constructing hierarchical models 
involving the NBP. One way to proceed is the following conditional independence hierarchy: 

So ~BP(^,7,i7) 
/d = $^id,fe<5v.. NBP(rd,So), 

k 

where we first draw a random measure Bq from the beta process and then draw multiple times from an NBP 
with base measure given by Bq. Although this conditional independence hierarchy does couple count vectors 
across multiple individuals, it does not have the flexibility of the HDP, which draws individual-specific mixing 
proportions from an underlying set of population-wide mixing proportions and then converts these mixing 
proportions into counts. We can capture this flexibility within an NBP-based framework by simply extending 
the hierarchy by one level: 

Bo^BP{e,j,H) 
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TABLE 1: A comparison of two featural constructions for Bayesian nonparametric clustering. PP indicates a 
Poisson point process draw with the given intensity. 



Beta negative binomial process 


Gamma Poisson process 


iy{db, dtjj) = 76*6-^(1 - 6)"-^ db H{dtp) 

B = J2k ^kS^k 

Afe Gamnia(r, ^^^) 
ik '~ Pois(Afe) 


v{d~g,di:) = eg-'^e-'^a dg H{di:) 

(|fc,V'fc)~PPM'^5,#)) 

G = Y,k 9khk 
ik '~ Pois(5fe) 



Bd" BP(0d,7d,So/So(*)) 
Id = Y.id,kS^u NBP(rd,Bd). 

k 

Since Bq is almost surely an atomic measure, the atoms of each Bd will coincide with those of Bq almost surely. 
The weights associated with these atoms can be viewed as individual-specific feature probability vectors. We 
refer to this prior as the hierarchical beta negative binomial process (HBNBP). 

We also note that it is possible to consider additional levels of structure in which a population is decomposed 
into subpopulations and further decomposed into subsubpopulations and so on, bottoming out in a set of 
individuals. This tree structure can be captured by repeated draws from a set of beta processes at each level 
of the tree, conditioning on the beta process at the next highest level of the tree. Hierarchies of this form have 
previously been explored for beta Bernoulli processes by Thibaux and Jordan (2007). 

5 Connections 

In the previous section we noted that both the beta negative binomial process (BNBP) and the gamma Poisson 
process (PPLP) provide nonparametric models for the count vectors arising in admixture models. In this 
section, we will elucidate some of the deeper connections between these two stochastic processes. We will see 
that understanding these connections can not only inspire new stochastic process constructions but also lead 
to novel inference algorithms. 

We are motivated by Table 1, which indicates a strong parallel between the BNBP and FPLP constructions, 
with the former requiring an additional random stage consisting of a draw from a gamma distribution. Here, 
we use the representation of the negative binomial distribution, i ~ NB(r, 6), as a garrmia mixture of Poisson 
distributions: 6 ~ Gamma(r, (1 — and i ~ Pois(6). However, this table mostly highlights the parallel on 
the level of the likelihood process and therefore on the level of classic, one-dimensional distributions. The 
relations between such distributions are well-studied. 

Noting that many classic, one-dimensional distributions are easily obtained from each other by simple 
change of variables, we aim to find new, analogous transformations in the stochastic process setting. In 
particular, all of our results in this section, which apply to nonparametric Bayesian priors derived from Poisson 
point processes, have direct analogues in the setting of one-dimensional distributions. We start by reviewing 
these known distributional relations. First, consider a beta distributed random variable x ~ Beta(a, 6). Then 
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the variable x/{l — x) has a beta prime distribution with parameters a and 6; specifically, P'{a,h) denotes the 
beta prime distribution with density 

The beta prime distribution can alternatively be derived from a gamma distribution. Namely, if a; Ganima(a, c) 
and y ~ Ganima(6, c) are independent, then x/y P'{a,b). This connection is not the only one between the 
beta and gamma distributions though. Alternatively, let 

X ~ Gamnia(o, c), y ~ Gamma(6, c). (6) 

Then 

a;/(a; + 2/) ~ Beta(a,6). (7) 

In the rest of this section, we present similar results but now for the nonparametric case — the beta process, 
gamma process and a new process we call the beta prime process. The proofs of these results appear in 
Appendix A. 

We start by defining a new completely random measure with nonnegative, real-valued feature weights. 
First, we note that, as for the processes defined in Section 2, there is no deterministic measure. Second, we 
specify that the fixed atoms have distribution 

r?;'S?/3'(^7Pi,^(l-7/>0) 

at locations (u/). Here, 9 > 0, j > 0, {pi)^i, and (u;) are parameters. As usual, while the number of fixed 
atoms L may be countably infinite, it is typically finite. Finally, the ordinary component has Poisson process 
intensity Hard x i^, where 

p{db) =jeb-^{l + b)-'^ db, (8) 

which we note is sigma-finite with finite mean, guaranteeing that the number of atoms generated from the 
ordinary component wiU be countably infinite with finite sum. 

We abbreviate by defining H = J2f=i Pi^ui + Hgrd and say that the resulting CRM B = J2k ^k^i^k ^ draw 
from a beta prime process (BPP) with base distribution H: B BPP(^,7, i?). The name "beta prime process" 
reflects the fact that the imderlying intensity is an improper beta prime distribution as well as the beta prime 
distribution of the fixed atoms. 

With this definition in hand, we can find the stochastic process analogues of the distributional results above 
(with proofs in Appendix A). Just as a beta prime distribution can be derived from a beta random variable, 
we have the following result that a similar transformation of the atom weights of a beta process yields a beta 
prime process. 

Proposition 4. Suppose B = Y^^ bkS^^ ~ BP(6»,7,iI). Then Y.k ih^^i'k ~ BPP(6i, 7, iJ). 
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Just as a beta prime random variable can be derived as the ratio of gamma random variables, we find that 
the atoms of the beta prime process can be constructed as by taking ratios of gamma random variables and 
the atoms of a gamma process. 

Proposition 5. Suppose G = J2k9k^i'k ~ rP(70, c, i?) and tu ~ Gamina(0(l — 7i?({V'fc})), c) independently for 
each k. Then J2k l^^^k ~ BPP{e,^,H). 

And, finally, the analogue to constructing a beta random variable from two gamma random variables is the 
construction of a beta process from a gamma process and an infinite vector of independent garrvma random 
variables. 

Proposition 6. Suppose G = J2k9k^^k ~ rP(7^, c, if) and Tk ~ Gamnia(^(l — ^H{{tpk})),c) independently for 
each k. Then Y.k ~ BP(e,7,if). 

The key to the manipulations above was the Poisson process framework of the ordinary component, which 
allows these easy manipulations on the stochastic process level. In particular, we see that BP itself can be 
derived from TP, and therefore the connection between the BNBP and PPLP is not restricted to just the 
negative binomial and Poisson likelihood. Moreover, besides introducing a further stochastic process in the 
form of the beta prime process, we emphasize that these relations potentially allow us to perform inference 
for a new stochastic process when inference for another, related stochastic process is already known — or to 
have available alternative, potentially faster or better mixing, inference algorithms. 

6 ASYMPTOTICS 

An important component of choosing a Bayesian prior is verifying that its behavior aligns with our beliefs 
about the behavior of the data-generating mechanism. In models of clustering, a particular measure of interest 
is the diversity — the dependence of the number of clusters on the number of data points. In speaking of the 
diversity, we typically assume a finite number of fixed atoms in a process derived from a CRM, so that 
asymptotic behavior is dominated by the ordinary component. 

It has been observed in a variety of different contexts that the number of clusters in a data set grows as a 
power law of the size of the data; that is, the number of clusters is asjnnptotically proportional to the number 
of data points raised to some positive power. Overviews of data examples are provided by Newman (2005) 
and Mitzenmacher (2004). 

The diversity has been characterized for the Dirichlet process (DP) and a two-parameter extension to the 
Dirichlet process known as the Pitman-Yor process (PYP) (Pitman and Yor, 1997), with extra parameter a G (0, 1) 
and concentration parameter 9 > —a. We will see that while the number of clusters generated according to a 
DP grows as a logarithm of the size of the data, the number of clusters generated according to a PYP grows 
as a power of the size of the data. Indeed, the popularity of the Pitman-Yor process — as an alternative prior to 
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the Dirichlet process in the clustering domain — can be attributed to this power-law growth (Goldwater et al., 
2006). In this section, we derive analogous asymptotic results for the BNBP treated as a clustering model. 

We first highlight a subtle difference between our model and a Dirichlet process. For a Dirichlet process, 
the number of data points N is known a priori and fixed. An advantage of our model is that it models 
the number of data points as a random variable and therefore has potentially more predictive power in 
modeling multiple populations. We note that a similar effect can be achieved for the Dirichlet process by using 
the gamma process for feature modeling as described in Section 4 rather than normalizing away the mass that 
determines the number of observations. However, there is no such unnormalized completely random measure 
for the PYP (Pitman and Yor, 1997). We thus treat N as fixed for the DP and PYP, in which case the number 
of clusters K{N) is a function of N. On the other hand, the number of data points N{r) depends on r in the 
case of the BNBP, and the number of clusters K{r) does as well. We also define Kj{N) to be the number of 
clusters with exactly j elements in the case of the DP and PYP, and we define Kj (r) to be the number of 
clusters with exactly j elements in the BNBP case. 

For the DP and PYP, K{N) and Kj{N) are random even though N is fixed, so it will be useful to also 
define their expectations: 

$(iV) ^ E[K{N)], $j(iV) ^ E[Kj{N)]. (9) 

In the BNBP and 3BNBP cases, all of K{r), Kj{r), and N{r) are random. So we further define 

^r)^E[K{r)], ^j{r) ^E[Kj{r)], ^ E[iV(r)]. (10) 

We summarize the results that we establish in this section in Table 2, where we also include comparisons 
to existing results for the DP and PYP. The full statements of our results, from which the table is derived, 
can be found in Appendix B, and proofs are given in Appendix C. 

The table shows, for example, that for the DP, ^{N) ^ 9 log(iV) as -> cx), and, for the BNBP, (r) 
as r -> 00 (i.e., constant in r). The DP result can be found in Korwar and Hollander (1973); both DP and 
PYP results can be found in Pitman and Yor (1997, Eq. 3.24 on p. 69 and Eq. 3.47 on p. 73). Note that the 
expected counts of clusters of size j results are asymptotic expansions in terms of r for fixed j and should 
not be interpreted as asymptotic expansions in terms of j. 

We conclude that, just as for the Dirichlet process, the BNBP can achieve both logarithmic cluster number 
growth in the basic model and power law cluster number growth in the expanded, three-parameter model. 

7 Simulation 

Our theoretical results in Section 6 are supported by simulation results, summarized in Figure 1; in particular, 
our simulation corroborated the existence of power laws in the three-parameter beta process case examined in 
Section 6. The simulation was performed as follows. For values of the negative binomial parameter r evenly 
spaced between 1 and 1,001, we generated beta process weights according to a beta process (or three-parameter 
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TABLE 2: Let N be the number of data points when this number is fixed and ^(r) be the expected number of 
data points when N is random. Let $(-/V), ^j{N), $(r), and $j(r) be the expected number of clusters under 
various scenarios and defined as in Eqs. 9 and 10. The upper part of the table gives the asymptotic behavior 
of $ up to a multiplicative constant, and the bottom part of the table gives the multiplicative constants. For 
the DP, > 0. For the PYP, a e (0, 1) and > -a. For the BNBP, > 1. For the 3BNBP, a e (0, 1) and 

e>i-a. 



Process Expected number of clusters Expected number of clusters of size j 



Function of N or ^(r) 



DP 
PYP 


log(iV) 




BNBP 


log(e(r)) 


1 


3BNBP 


mr 


mr 


Constants 


DP 







PYP 
BNBP 
3BNBP 


r(e+i) 

ar{e+a) 

76* 

7!-° r(e+i) /e+a-i\<^ 
a r{e+a) \ ) 


„j-a r(0+i) ru-a) (e+a-lY' 
1 r(i-a)r(e+a) ru+i) \ e J 



beta process) stick-breaking representation (Paisley et al., 2010; Broderick et al., 2012). For each of the resulting 
atoms, we simulated negative binomial draws to arrive at a sample from a BNBP. For each such BNBP, we 
can count the resulting total number of data points N and total number of clusters K. Thus, each r gives us 
an (r, N, K) triple. 

In the simulation, we set the mass parameter 7 = 3. We set the concentration parameter ^ = 3; in particular, 
we note that the analysis in Section 6 implies that we should always have ^ > 1. Finally, we ran the simulation 
for both the a = case, where we expect no power law behavior, and the a = 0.5 case, where we do expect 
power law behavior. The results are shown in Figure 1. Is this figure, we scatter plot the (r, K) tuples from 
the generated (r, N, K) triples on the left and plot the {N, K) tuples on the right. 

In the left plot, the upper black points represent the simulation with a ~ 0.5, and the lower blue data 
points represent the a — Q case. The lower red line illustrates the theoretical result corresponding to the a = 
case (Lemma 10 in Appendix B), and we can see that the anticipated logarithmic growth behavior agrees 
with our simulation. The upper red line illustrates the theoretical result for the a = 0.5 case (Lemma 11 in 
Appendix B). The agreement between simulation and theory here demonstrates that, in contrast to the a = 
case, the a = 0.5 case exhibits power law growth in the number of cluster K as a function of the negative 
binomial parameter r. 

Our simulations also bear out that the expectation of the random number of data points N increases linearly 
with r (Lemmas 8 and 9 in Appendix B). We see, then, on the right side of Figure 1 the behavior of the number 
of clusters K now plotted as a function of N. As expected given the asymptotics of the expected value of N, 
the behavior in the right plot largely mirrors the behavior in the left plot. Just as in the left plot, the lower 
red line (Theorem 14 in Appendix B) shows the anticipated logarithmic growth of K and N when a = 0. And 
the upper red line (Theorem 15 in Appendix B) shows the anticipated power law growth of K and N when 
a = 0.5. 
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10° 10^ 10^ 10^ 10° 10^ lO'' 

Negative binomial parameter r Number of data points N 

Fig. 1: For each r evenly spaced between 1 and 1,001, we simulate (random) values of the number of data 
points N and number of clusters K from the BNBP and 3BNBP. In both plots, we have mass parameter 

7 = 3 and concentration parameter — "i. On the left, we see the number of clusters K as a fimction of the 
negative binomial parameter r (see Lemma 10 and Lemma 11 in Appendix B); on the right, we see the number 
of clusters i^T as a function of the (random) number of data points N (see Theorem 14 and Theorem 15 in 
Appendix B). In both plots, the upper black points show simulation results for the case a — 0.5, and the lower 
blue points show a = 0. Red lines indicate the theoretical asymptotic mean behavior we expect from Section 6. 

We can see the parallels with the DP and PYP here. Clusters generated from the Dirichlet process (i.e.. 
Pitman- Yor process with a = 0) exhibit logarithmic growth of the expected number of clusters K as the 
(deterministic) number of data points N grows. And clusters generated from the Pitman- Yor process with 
a E (0, 1) exhibit power law behavior in the expectation of iiT as a function of (fixed) N. So too do we see 
that the BNBP, when applied to clustering problems, yields asymptotic growth similar to the DP and that 
the 3BNBP yields asymptotic growth similar to the PYP. 

8 Posterior inference 

In this section we present posterior inference algorithms for the HBNBP. We focus on the setting in which, 
for each individual d, there is an associated exchangeable sequence of observations {xd,n)n=i- seek to 
infer both the admixture component responsible for each observation and the parameter ipk associated with 
each component. Hereafter, we let Zd,n denote the unknown component index associated with Xd,n, so that 

Under the HBNBP admixture model introduced in Section 4, the posterior over component indices and 
parameters has the form 

p(z.,.,'0. I x.,.,e) cxp(z.,.,'0.,bo,.,b.,. I x.,.,e), 

where Q = {F, H,'-fo,9o,j.,0.,r.) is the collection of all fixed hyperparameters. As is the case with HDP 
admixtures (Teh et al., 2006) and earlier hierarchical beta process featural models (Thibaux and Jordan, 2007), 
the posterior of the HBNBP admixture cannot be obtained in analytical form due to complex couplings in 
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the marginal p(x.^. | 9). We therefore develop Gibbs sampling algorithms (Geman and Geman, 1984) to draw 
samples of the relevant latent variables from their joint posterior. 

A challenging aspect of inference in the nonparametric setting is the coimtable infinitude of component 
parameters and the countably infinite support of the component indices. We develop two sampling algorithms 
that cope with this issue in different ways. In Section 8.1, we use slice sampling to control the number of 
components that need be considered on a given round of sampling and thereby derive an exact Gibbs sampler 
for posterior inference under the HBNBP admixture model. In Section 8.2, we describe an efficient alternative 
sampler that makes use of a finite approximation to the beta process. Throughout we assume that the base 
measure H is continuous. We note that neither procedure requires conjugacy between the base distribution 
H and the data-generating distribution F. 

8.1 Exact Gibbs slice sampler 

Slice sampling (Damien et al., 1999; Neal, 2003) has been successfully employed in several Bayesian nonpara- 
metric contexts, including Dirichlet process mixture modeling (Walker, 2007; Papaspiliopoulos, 2008; Kalli 
et al., 2011) and beta process feature modeling (Teh et al., 2007). The key to its success lies in the introduction 
of one or more auxiliary variables that serve as adaptive truncation levels for an infinite sum representation 
of the stochastic process. 

This adaptive truncation procedure proceeds as follows. For each observation associated with individual d, 
we introduce an auxiliary variable Ud,n with conditional distribution 



where (Cd,/s)^i is a fixed positive sequence with limfe^oo ^d,k = 0. To sample the component indices, we recall 
that a negative binomial draw id,k ~ NB(r(i, hd,k) niay be represented as a gamma-Poisson mixture: 



We first sample \d,k from its full conditional. By gamma-Poisson conjugacy, this has the simple form 



We next note that, given Ad, and the total number of observations associated with individual d, the 
cluster sizes id^k may be constructed by sampling each Zd^n independently from \d,- / ^k'^d.k and setting 
id,k = J2n^i^d<n — Hence, conditioned on the number of data points Nd, the component parameters ^k, 
the auxiliary variables Xd,k, and the slice-sampling variable Ud,n, we sample the index Zd,n from a discrete 
distribution with 



Ud,n 




id,k ~ Pois(Ad,fc). 



Ad.fc ~ Gamma (r^ + id,k, '^/bd,k) ■ 



^Zd,n = 



k) oc F{dxd,n I i^k) 



I(Md,n < £,d,k) 
id,k 



Xd,k 
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SO that only the finite set of component indices {k : ^d,k > Ud,n} need be considered when sampling Zd,n- 

Let Kd = max{A; : 3n s.t. ^d,k > Ud,n} and K = maxd Kd- Then, on a given roimd of sampling, we need only 
explicitly represent \d,k and hd,k for k < Kd and and 6o,fc for k < K. The simple Gibbs conditionals for 
hd,k and il)k can be foimd in Appendix E.l. To sample the shared beta process weights 6o,/c/ we leverage the 
size-biased construction of the beta process introduced by Thibaux and Jordan (2007): 

OO Cm 

m=0 i=l 

where 

Cm '~ Pols ( ) , 6o,m,i '~ Beta(l, 6*0 + m), and Vm,i,- ~ H, 

\6o+mJ 

and we develop a Gibbs slice sampler for generating samples from its posterior. The details are deferred to 
Appendix E.l. 

8.2 Finite approximation Gibbs sampler 

An alternative to the size-biased construction of Bq is a finite approximation to the beta process with a fixed 
number of components, K: 

6o,fe''^'Bcta(eo7o/-?^,^o(l-7o/^)), ^k'^ H, ke{l,...,K}. (11) 

It is known that the distribution of J2k=i ^0,/c^Vfc converges to BP(^, 7, H) as the level of approximation K ^ 00 
(c.f. the proof of Theorem 3.1 by Hjort (1990) with the choice Ao{t) = 7), and the prespecified component coxmt 
often leads to computational savings in practice. The detailed conditionals of the finite approximation Gibbs 
sampler can be foimd in Appendix E.3. 

9 Image Segmentation and Object Recognition 

In the next two sections, we show how the HBNBP admixture model can be used as a practical building block 
for more complex supervised and unsupervised inferential tasks. 

Two problems of wide iaterest in the computer vision commimity are image segmentation, dividing an image 
into its distinct, semantically meaningful regions, and object recognition, labeling the regions of images according 
to their semantic object classes. Solutions to these problems are at the core of applications like content-based 
image retrieval, video surveying, and object tracking. Here we wiU take an admixture modeling approach to 
jointly recognizing and localizing objects within images (Cao and Li, 2007; RusseU et al., 2006; Sivic et al., 2005; 
Verbeek and Triggs, 2007). Each individual d is an knage comprised of Nd image patches (observations), and 
each patch „ is assumed to be generated by an unknown object class (a latent component of the admixture). 
Given a series of training images with image patches labeled, the problem of recognizing and localizing objects 
in a new image reduces to inferring the latent class associated with each new image patch. We will tackle this 
inferential task with the HBNBP admixture model and the finite approximation Gibbs sampler of Section 8 
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and compare its performance with that of a more standard model of admixture. Latent Dirichlet Allocation 
(LDA) (Blei et al, 2003). 

9.1 Representing an Image Patch 

We will represent each image patch as a vector of visual descriptors drawn from multiple modalities. Verbeek 
and Triggs (2007) suggest three complementary modalities: texture, hue, and location. Here, we introduce a 
fourth: opponent angle. To describe hue, we use the robust hue descriptor of Van De Weijer and Schmid (2006), 
which grants invariance to illuminant variations, lighting geometry, and specularities. For texture description 
we use "dense SIFT" features (Lowe, 2004; Dalai and Triggs, 2005), histograms of oriented gradients computed 
not at local keypoints but rather at a single scale over each patch. To describe coarse location, we cover each 
image with a regular c x c grid of cells (for a total of = (? cells) and assign each patch the index of 
the covering cell. The opponent angle descriptor of Van De Weijer and Schmid (2006) captures a second 
characterization of image patch color. These features are invariant to specularities, iUimfiinant variations, and 
diffuse lighting conditions. 

To build a discrete visual vocabulary from these raw descriptors, we vector quantize the dense SIFT, hue, 
and opponent angle descriptors using k-means, producing ^^'f*, ^hue^ yopp clusters respectively. Finally, 
we form the observation associated with a patch by concatenating the four modality components into a single 
vector, Xd,n = ('^d ^d"n ' -^rf «' ■'^d^n )■ Verbcck and Triggs (2007), we assume that the descriptors from 

disparate modalities are conditionally independent given the latent object class of the patch. Hence, we define 
our data generating distribution and our base distribution over parameters i/)^ = (^^^SV'fc"'') V'fe"^)V'fc''^) via 

^™ '~ Dirichlet(r7lym) for m e {sift, hue, loo, opp} 

^'d,n I ^d,n-, tp. '~ Mult(l, V"^ „) for m G {sift, hue, loc, opp} 

for a h5^erparameter r] and lym a ^^-dimensional vector of ones. 

9.2 Experimental Setup 

We use the Microsoft Research Cambridge pixel-wise labeled image database vl* in our experiments. The data 
set consists of 240 images, each of size 213 x 320 pixels. Each image has an associated pixel-wise groimd truth 
labeling, with each pixel labeled as belonging to one of 13 semantic classes or to the void class. Pixels have a 
groxmd truth label of void when they do not belong to any semantic class or when they lie on the botmdaries 
between classes in an image. The dataset provider notes that there are insufficiently many instances of horse, 
mountain, sheep, or water to learn these classes, so, as in Verbeek and Triggs (2007), we treat these ground truth 
labels as void as well. Thus, our general task is to learn and segment the remaining 9 semantic object classes. 

From each image, we extract 20 x 20 pixel patches spaced at 10 pixel intervals across the image. We choose 
the visual vocabulary sizes (V^"'*, V^"", V°pp) = (1000, 100, 100, 100) and fbc the hyperparameter r? = 0.1. 

4. http://research.irucrosoft.com/vision/cambridge/recognition/ 
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Fig. 2: MSRC-vl test image segmentations inferred by the HBNBP admixture model (best viewed in color). 

As in Verbeek and Triggs (2007), we assign each patch a ground truth label zd.n representing the most frequent 
pixel label within the patch. When performing posterior inference, we divide the dataset into training and test 
images. We allow the inference algorithm to observe the labels of the training image patches, and we evaluate 
the algorithm's ability to correctly infer the label associated with each test image patch. 

For inference under the HBNBP admixture model, we employ the finite approximation Gibbs sampler of 
Section 8.2 with hyperparameters (70, ^^o, 7d, ^d) — (3,3,1,10) for all images d and set according to the 
heuristic = -/Vd(0o ^ l)/(^'o7o)- arrive at this heuristic by matching to its expectation under a non- 
hierarchical BNBP model and solving for r^^: 



E[Nd] = raE [^^^^ 6^,^/(1 - bd,k)\ = 70^0/(^0 - 1 



We draw 2000 samples and, for each test patch, predict the label with the highest probability under the 
final sample. We compare HBNBP performance with that of LDA using the standard variational inference 
algorithm of Blei et al. (2003) and maximum a posteriori prediction of patch labels. For each model, we set 
K = 10, allowing for the 9 semantic classes plus void, and, following Verbeek and Triggs (2007), we ensure 
that the void class remains generic by fixing t/j^o = ijhr: ■ ■ ■ , y^) for each modality m. 



9.3 Results 

Figure 2 displays sample test image segmentations obtained using the HBNBP admixture model. Each pixel 
is given the predicted label of its closest patch center. Test patch classification accuracies for the HBNBP 
admixture model and LDA are reported in Tables 3a and 3b respectively. All results are averaged over twenty 
randomly generated 90% training / 10% test divisions of the data set. The two methods perform comparably, 
with the HBNBP admixture model outperforming LDA in the prediction of every object class save building. 
Indeed, the mean object class accuracy is 0.78 for the HBNBP model versus 0.76 for LDA, showing that the 
HBNBP provides a viable alternative to more classical approaches to admixture. 
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TABLE 3: Confusion matrices for patch-level image segmentation and object recognition on the MSRC-vl 
database. We report test image patch inference accuracy averaged over twenty randomly generated 90% 
training / 10% test divisions. 

(a) HBNBP Confusion Matrix 



Predicted Class Label 





building 


grass 


tree 


cow 


sky 


aeroplane 


tace 


car 


bicycle 


building 


0.65 1 


0.01 


0.05 


0.00 


0.03 


0.09 


0.01 


0.03 


0.10 


grass 


0.00 


0.89 


0.06 


0.02 


0.00 


0.01 


0.00 


0.00 


0.00 


tree 


0.01 


0.08 


0.75 


0.01 


0.04 


0.03 


0.00 


0.00 


0.07 


cow 


0.01 


0.10 


0.04 


0.72 


0.00 


0.00 


0.05 


0.01 


0.01 


sky 


0.04 


0.00 


0.01 


0.00 


0.92 


0.01 


0.00 


0.00 


0.00 


aeroplane 


0.09 


0.04 


0.01 


0.00 


0.02 


0.80 


0.00 


0.03 


0.00 


face 


0.04 


0.00 


0.01 


0.04 


0.00 


0.00 


0.84 


0.00 


0.00 


car 


0.20 


0.00 


0.01 


0.00 


0.01 


0.01 


0.00 


0.02 


bicycle 




0.00 


0.04 


0.00 


0.00 


0.00 


0.01 


o!o2^B^H 



(b) LDA Confusion Matrix 
Predicted Groups 



< 





building 


grass 


tree 


cow 


sky 


aeroplane 


tace 


car 


bicycle 


building 


0.69 


0.01 


0.04 


0.01 


0.03 


0.07 


0.01 


0.03 


0.08 


grass 


0.00 




0.05 


0.02 


0.00 


0.01 


0.00 


0.00 


0.00 


tree 


0.02 


0.08 


0.75 


0.01 


0.04 


0.02 


0.00 


0.00 


0.05 


cow 


0.00 


0.10 


0.03 


0.70 


0.00 


0.00 


0.05 


0.01 


0.01 


sky 


0.05 


0.00 


0.02 


0.00 


0.91 


0.01 


0.00 


0.00 


0.00 


aeroplane 


0.12 


0.04 


0.01 


0.00 


0.02 




0.00 


0.03 


0.00 


face 


0.04 


0.00 


0.01 


0.05 


0.00 


0.00 


0.80 


0.00 


0.00 


car 


0.19 


0.00 


0.01 


0.00 


0.01 


0.01 


0.00 


Km 


0.03 


bicycle 


0.19 


0.00 


0.04 


0.01 


0.00 


0.00 


0.00 


0.02 


0.68 



TABLE 4: Sensitivity of HBNBP admixture model to hyperparameter specification for joint image segmentation 
and object recognition on the MSRC-vl database. Each hyperparameter is varied across the specified range 
while the remaining parameters are held fixed to the default values reported in Section 9.2. We report test 
patch inference acciiracy averaged across object classes and over twenty randomly generated 90% training / 
10% test divisions. 



Hyperparameter 


Parameter range 


Minimum accuracy 


Maximum accuracy 


7o 


[0.3, 30] 


0.782 


0.783 


^0 


[1.5,30] 


0.782 


0.783 


V 


[2 X 10-1^, 1] 


0.777 


0.785 



9.4 Parameter Sensitivity 

To test the sensitivity of the HBNBP admixture model to misspecification of the mass, concentration, and 
likelihood hyperparameters, we measure the fluctuation in test set performance as each h5^erparameter 
deviates from its default value (with the remainder held fixed). The results of this study are summarized 
in Table 4. We find that the HBNBP model is rather robust to changes in the hj^erparameters and maintains 
nearly constant predictive performance, even as the parameters vary over several orders of magnitude. 

10 Document Topic Modeling 

To further validate the expressiveness of the HBNBP admixture model and the effectiveness of the posterior 
inference procedures discussed in Section 8, we consider the task of document topic modeling, in which each 
individual c? is a document containing Nd observations (words) and each word a;^ „ belongs to a vocabulary 
of size V. The topic modeling framework is an instance of admixture modeling in which we assume that each 
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word of each document is generated from a latent admixture component or topic, and our goal is to infer the 
topic underlying each word. 

In our experiments, we let Hard, the ^ dimension of the ordinary component intensity measure, be a Dirichlet 
distribution with parameter r/l for r? = 0.1 and 1 a V-dimensional vector of ones and let F{ipk) be Mult(l, ipk)- 
We again fix the global and document-specific mass and concentration parameters as (70, Oq, jd, Od) = (3, 3, 1, 10) 
for all d and set the document-specific negative binomial shape parameter according to the heuristic 
rd = Nd{Oo — l)/(6'o7o)- When applying the exact Gibbs slice sampler, we let the slice sampling decay sequence 
follow the same pattern across all documents: ^d,k = l-S"*^. 

10.1 Worldwide Incidents Tracking System 

We report results on the Worldwide Incidents Tracking System (WITS) data set.^ This data set consists of reports 
on 79,754 terrorist attacks from the years 2004 through 2010. Each event contains a written surrmiary of the 
incident, location information, victim statistics, and various binary fields such as "assassination," "lED," and 
"suicide." We transformed each incident into a text document by concatenating the summary and location 
fields and then adding further words to account for other, categorical fields: e.g., an incident with seven 
hostages would have the word "hostage" added to the document seven times. We used a vocabulary size of 
1/ = 1,048 words. 

10.1.0.1 Perpetrator Identification: Our experiment assesses the ability of the HBNBP admixture model 
to discriminate among incidents perpetrated by different organizations. We first grouped documents according 
to the organization claiming responsibility for the reported incident. We considered 5,390 claimed documents 
in total distributed across the ten organizations listed in Table 5. We removed all orgarvization identifiers from 
all documents and randomly set aside 10% of the documents in each group as test data. Next, for each group, 
we trained an independent, organization-specific HBNBP model on the remaining documents in that group by 
drawing 10,000 MCMC samples. We proceeded to classify each test document by measuring the likelihood of 
the document imder each trained HBNBP model and assigning the label associated with the largest likelihood. 
The resulting confusion matrix across the ten candidate organizations is displayed in Table 6a. Results are 
reported for the exact Gibbs slice sampler; performance under the firute approximation sampler is nearly 
identical. 

For comparison, we carried out the same experiment using the more standard HDP admixture model in 
place of the HBNBP. For posterior inference, we used the HDP block sampler code of Yee Whye Teh^ and 
initialized the sampler with 100 topics and topic hj^erparameter rj = 0.1 (all remaining parameters were set 
to their default values). For each organization, we drew 250,000 MCMC samples and kept every twenty-fifth 
sample for evaluation. The confusion matrix obtained through HDP modeling is displayed in Table 6b. We see 
that, overall, HBNBP modeling leads to more accurate identification of perpetrators than its HDP coimterpart. 

5. https://wits.nctc.gov 

6. http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-rl.tgz 
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TABLE 5: The number of incidents claimed by each organization in the WITS perpetrator identification 
experiment. 



Group ID 


Perpetrator 


# Claimed Incidents 


1 


taliban 


2647 


2 


al-aqsa 


417 


3 


fare 


76 


A 

4 


izz al-din al-qassam 


4/o 


5 


hizballah 


89 


6 


al-shabaab al-islamiya 


426 


7 


al-quds 


505 


8 


abu ali mustafa 


249 


9 


al-nasser salah al-din 


212 


10 


commimist party of nepal (maoist) 


291 



TABLE 6: Confusion matrices for WITS perpetrator identification. See Table 5 for the organization names 
matching each group ID. 

(a) HBNBP Confusion Matrix 

Predicted Groups 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


1 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 


0.00 


0.38 


0.00 


0.02 


0.00 


0.00 


0.29 


0.29 


0.02 


0.00 


3 


0.00 


0.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 


0.00 


0.00 


0.00 


0.54 


0.00 


0.00 


0.15 


0.27 


0.04 


0.00 


5 


0.11 


0.33 


0.00 


0.11 


0.44 


0.00 


0.00 


0.00 


0.00 


0.00 


6 


0.02 


0.00 


0.00 


0.00 


0.00 


0.98 


0.00 


0.00 


0.00 


0.00 


7 


0.00 


0.10 


0.00 


0.06 


0.02 


0.00 




0.30 


0.04 


0.00 


8 


0.00 


0.04 


0.00 


0.00 


0.00 


0.00 


0.16 


0.76 


0.04 


0.00 


9 


0.00 


0.10 


0.00 


0.05 


0.10 


0.00 


0.29 


0.43 


0.05 


0.00 


10 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 





(b) HDP Confusion Matrix 



Predicted Groups 





1 


2 


3 


4 


5 


6 


7 


d 


9 


10 


1 


, 0.46 


0.00 


0.26 


0.00 


0.03 


0.23 


0.00 


0.00 


0.00 


0.01 


2 


0.00 


0.31 


0.02 


0.02 


0.00 


0.00 


0.29 


0.36 


0.00 


0.00 


3 


0.00 


0.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 


0.00 


0.00 


0.00 




0.04 


0.00 


0.06 


0.31 


0.06 


0.00 


5 


0.11 


0.00 


0.00 


0.00 


0.44 


0.00 


0.11 


0.11 


0.11 


0.11 


6 


0.00 


0.00 


0.00 


0.00 


0.00 


1.00 


0.00 


0.00 


0.00 


0.00 


7 


0.00 


0.10 


0.00 


0.04 


0.00 


0.00 


0.38 


0.42 


0.06 


0.00 


8 


0.00 


0.04 


0.00 


0.00 


0.00 


0.00 


0.08 


0.84 


0.04 


0.00 


9 


0.00 


0.05 


0.00 


0.10 


0.00 


0.00 


0.24 


0.62 


0.00 


0.00 


10 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


1.00 



Most notably, the HDP wrongly attributes more than half of all documents from group 1 (taliban) to group 
3 (fare) or group 6 (al-shabaab al-islamiya). We hypothesize that the HBNBP's superior discriminative power 
stems from its ability to distinguish between documents both on the basis of word frequency and on the basis 
of document length. 

We would expect the HBNBP to have greatest difficulty discriminating among perpetrators when both word 
usage frequencies and document length distributions are similar across groups. To evaluate the extent to which 
this occurs in our perpetrator identification experiment, for each orgarvization, we plotted the density histogram 
of document lengths in Figure 3a and the heat map displaying word usage frequency across all associated 
documents in Figure 3b. We find that the word frequency patterns are nearly identical across groups 2, 7, 8, 
and 9 (al-aqsa, al-quds, abu ali mustafa, and al-nasser salah al-din, respectively) and that the document length 
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TABLE 7: The ten most probable words from the most probable topic in the final MCMC sample of each group 
in the WITS perpetrator identification experiment. The topic probability is given in parentheses. See Table 5 
for the organization names matching each group ID. 

HBNBP: Top topic per organization 

group 1 (0.29) afghanistan, assailants, claimed, responsibility, armedattack, fired, police, victims, armed, upon 

group 2 (0.77) Israel, assailants, armedattack, responsibility, fired, claimed, district, causing, southern, damage 

group 3 (0.95) Colombia, victims, facility, wounded, armed, claimed, forces, revolutionary, responsibility, assailants 

group 4 (0.87) Israel, fired, responsibility, claimed, armedattack, causing, injuries, district, southern, assailants 

group 5 (0.95) victims, wounded, facility, Israel, responsibility, claimed, armedattack, fired, rockets, katyusha 

group 6 (0.54) wounded, victims, Somalia, civilians, wounding, facility, killing, mortars, armedattack, several 

group 7 (0.83) Israel, district, southern, responsibility, claimed, fired, armedattack, assailants, causing, injuries 

group 8 (0.94) Israel, district, southern, armedattack, claimed, fired, responsibility, assailants, causing, injuries 

group 9 (0.88) Israel, district, southern, fired, responsibility, claimed, armedattack, assailants, causing, injuries 

group 10 (0.80) nepal, victims, hostage, assailants, party, commimist, claimed, front, maoist/united, responsibility 
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(a) Density histograms of document lengths. 
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(b) Heat map of word frequencies for the 200 most common words across all documents (best viewed in color). 

Fig. 3: Document length distributions and word frequencies for each organization in the WITS perpertrator 
identification experiment. 



distributions of these four groups are all well aligned. As expected, the majority of classification errors made 
by our HBNBP models result from misattribution among these same four groups. The same group similarity 
structure is evidenced in a display of the ten most probable words from the most probable HBNBP topic 
for each group. Table 7. There, we also find an intuitive summary of the salient regional and methodological 
vocabulary associated with each organization. 
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11 Conclusions 

Motivated by problems of admixture, in which individuals are represented multiple times in multiple latent 
classes, we introduced the negative binomial process, an infinite-dimensional prior for vectors of counts. We 
developed new nonparametric admixture models based on the NBP and its conjugate prior, the beta process, 
and characterized the relationship between the BNBP and preexisting models for admixture. We also analyzed 
the asymptotics of our new priors, derived MCMC procediires for posterior inference, and demonstrated the 
effectiveness of our models in the domains of image segmentation and document analysis. 

There are many other problem domains in which latent vectors of coimts provide a natural modeling 
framework and where we believe that the HBNBP can prove useful. These include the computer vision task 
of multiple object recognition, where one aims to discover which and how many objects are present in a given 
image (Titsias, 2008), and the problem of modeling copy number variation in genomic regions, where one seeks 
to infer the imderlying events responsible for large repetitions or deletions in segments of DNA (Chen et al., 
2011). 
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Appendix A 

Proofs for Section 5 

Proof of Proposition 4: First, consider the ordinary component of a beta process. The Mapping Theorem of 
Kingman (1993) tells us that if the collection of tuples {ipk,bk) come from a Poisson process with intensity 
Hord X ^beta, whete Ubeta IS the beta process intensity of Eq. 2, then the collection of tuples {ipk, bk/{l — bk)) 
are draws from a Poisson process with intensity Hord x where we apply a change of variables to find: 

^ ' \l + b) \ 1 + bJ (1 + 6)2 
u{db) = 'yeb-\l + b)-^ db, 

which matches Eq. 8. 

For any particular atom where bk ~ Beta,{9^pk,0{l — 7Pfe)) and pk = H{{ipk}) > 0, we simply quote the 
well-known, one-dimensional change of variables bk/{l — bk) P'{6^pk,0{l — ^pk))- 

Since there is no determirvistic component, we have considered all components of the completely random 
measure. □ 



Proof of Proposition 5: We again start with the ordinary component of a completely random measure. In 
particular, we assume the collection of tuples {tpk,9k) is generated according to a Poisson process with intensity 
Hord X '^gamma, where Vgamma IS the gamma process intensity of Eq. 4. 

Consider a random variable Tfc ~ Gamma(0, c) associated with each such tuple. Then 1/rfe ~ IG(0, c). 
We consider a marked Poisson process with mark bk = gk/rk at tuple {ipk,gk) of the original process. By 
the scaling property of the inverse gamma distribution, we note bk ^ lG{9,cgk) given g^. So the Marking 
Theorem (Kingman, 1993) implies that the collection of tuples {ipk,bk) is itself a draw from a Poisson point 
process with intensity Hord x i^, where 

u{db) = J p(b\ e, c, g)u{dg) dg 

= dbj ^{cg)%-'-'exp{-cg/b)j9g-'exp{-cg)dg 

= lOc'^^b-'^-'db J f-^ exp(-ffc(l + b)/b) dg 

= ^^c^ J_6-e-ir(^) ( — db 
T{e) ^'\c{l + b)) 

e 



-fOb-^ (l + Pj db, 



which matches the beta prime process ordinary component intensity of Eq. 8. 

For any particular atom of the gamma process, gk ~ Gamma(^7Pfe, c) with pk = H{{'tpk}) > 0, it is well 
known that gk/rk has the p' {6^pk,0{l — 'ypk)) distribution, as desired. 
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There is no deterministic component of the gamma process. □ 

Proof of Proposition 6: Before proceeding to prove Proposition 6 in the maimer of the proofs of Propositions 4 
and 5 above, we first note that Proposition 5 can be derived from Proposition 5 and an inverse change of 
variables from that in Proposition 6. 

Taking the same direct route of proof as above, though, we begin with the ordinary component of the gamma 
process so that the collection of tuples {ipk,gk) is generated according to a Poisson process with intensity 
Hord X '^gamrna, where lygamma IS the gamma proccss intensity of Eq. 4. The Marking Theorem (Kingman, 1993) 
tells us that the marked Poisson process with points (V'fe, ^/c, ^fe) has intensity Hord x i^, where 

v{dg,dT) = 76'5-^e-^s(r(6l))-V^-^exp(-cr) dg dr. 

Now consider the change of variables u = g/{g + t),v — g + t. Then the Poisson point process with points 
{tpk,gk/{gk + Tk),gk + Tk) has intensity Hord x v, where 

v{di), du, dv) = {T{0))-'^-fec\-^{l - uf-'^v^-^e-'''' du dv. 

So the Poisson point process with points {ilJk,gk/{gk + Tk)) has intensity Hgrd x i^, with 

v{dil),du) = I fj,{dip,du,dv) 

J V 

= f {T{e))-^-f0c\-\l - uf-^'^-^e-^" du dv 

J V 

= {T{e))-^^ec^u-\i - u)^-^T{e)c-^ du 

= ^eu-\l - u)^-'^ du, 

which is the known beta process intensity. 
In the discrete case with H{{tpk}) = pk > 0, we have by construction 

gk ~ Gamma(^7pfe,c) 

and 

Tfe ~ Gamma(^(l - 7Pfe), c). 

From classic finite distributional results, we have 

— ~ Beta(6'7Pfe, 0{1 - jpk)), 
Tk + gk 

exactly as in the case of the beta process. 
As the gamma process and beta process each have no deterministic components, this completes the proof. □ 
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Appendix B 

Full results for Section 6 

In order to fill in Table 2, we establish how the expected number of data points, ^(r), grows asymptotically 
with r in the BNBP case (in Lemma 8) and the 3BNBP case (in Lemma 9). We begin by showing that the 
expected number of data points is infinite for the concentration parameter range ^ < 1 — a in both the BNBP 
(a = 0) and 3BNBP models. 

Lemma 7. Assume that the discount parameter for three-parameter beta process satisfies a G [0, 1) (the beta process is 
the special case when a — 0), the concentration parameter satisfies 9 < 1 — a, and the mass parameter satisfies 7 > 0. 
Then the expected number of data points, ^(r) = E[X]fe ik]f f^om a BNBP or 3BNBP , as appropriate, is infinite. 

Lemma 8. Assume that the concentration parameter for the beta process satisfies 6 > 1 and the mass parameter satisifies 
7 > 0. Then the expected number of data points ^(r) = i^] from a BNBP has asymptotic growth 

Lemma 9. Assume that a three-parameter beta process has discount parameter a e (0, 1) and concentration parameter 
B>\ — a. Then the expected number of data points ^(r) = ]E[^j(. ife] from a 3BNBP has asymptotic growth 

Q 

^{r) ~ 77— 7r, r 00. 

u -\- cc — 1 

Next, we establish how the expected number of clusters, $(r), grows asymptotically as r ^ 00 in the BNBP 
case (in Lemma 10) and in the 3BNBP case (in Lemma 11). 

Lemma 10. Let 9 > 0. Then the expected number of clusters $(r) = ^[2lk ^{^fc > 0}] fr^^ ^ BNBP has asymptotic 
growth 

$(r)~7^1ogr, r — > 00. 

Lemma 11. Consider a three-parameter beta process. Let the discount parameter satisfy a > and the concentration 
parameter satisfy 9 > —a. Then the number of clusters K{r) J^^. l{ik > 0}from a 3BNBP has almost sure asymptotic 
growth 

a I [9 -\- a) 

We are also interested in how the expected number of clusters of size j, ^j{r), grows as r — >^ 00. To that 
end, we establish this asymptotic growth in the BNBP case in Lemma 12 and in the 3BNBP case in Lemma 13 
below. 

Lemma 12. Let 9 > 0. Then the expected number of clusters of size j, ^j{r) — K[J2k ^{^fe = j}], from a BNBP has 
asymptotic growth 

$j(r) ~ "y9j~^, r — >■ 00. 
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That is, the number is asymptotically constant in r. 

Lemma 13. Let 9 > -a and a e (0, 1). Then the expected number of clusters of size j, ^j{r) = EEfe l{«/c = j}], 
from a 3BNBP has asymptotic growth 



r(l + 9) T{] - a) 



r — )• oo. 



r(l - a)T{9 + a) T{] + 1) 

Finally, we wish to combine these results to establish asymptotic results for the diversity, i.e., the expected 
number of clusters (or clusters of size j) as the expected number of data points varies. We find the asymptotic 
growth in the niimber of clusters for the BNBP in Theorem 14 and for the 3BNBP in Theorem 15. We find 
the asymptotic growth in the number of clusters of size j for the BNBP (in fact, the result has already been 
shown in Lemma 12) and for the 3BNBP in Theorem 16. 

Theorem 14. Let 6 > 1. Then the expected number of clusters $ grows asymptotically as the log of the expected number 
of data points ^: 

$(r) ~ 7^1og(^(r)), r — >■ c«. 

Theorem 15. Let 9 + a > 1 and a G (0, 1). Then the number of clusters K grows asymptotically as a power of the 
expected number of data points ^; 

K{r) ~ „ , „ 5 (CW)", r ^ 00. 



a T{9 + a)\ 9 

Theorem 16. Let 9 + a> 1 and a e (0, 1). Then the expected number of clusters of size j, $j, grows asymptotically 
as a power of the expected number of data points ^: 

> ^ T{1 -a)T{9 + a) rU + l){ 9 J 

Appendix C 

Proofs for Appendix B 

Proof of Lemma 7: hi this case, we have 



E[J2ik\h: 



by the tower property 



= E 



^E[ifc|b.: 



by monotonicity 



= E 



E 



(1 - bk) 



using the mean of the negative binomial distribution 

br 
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by Campbell's Theorem (Kingman, 1993) 



r(l - a)r(^ + a) Jo 



The final line is finite iff 



1 - a > 0, and 9 + a - 1 > 0. 



Equivalently, the final line is finite iff 



a < 1 and 6 > I — a. 



□ 



Proof of Lemma 8: Let B = J2k^k'>Pk be beta process distributed. Let ik ~ NB(r, By the Marking 
theorem (Kingman, 1993), the Poisson process {bk,4>k,ik} has intensity 



u{db, d^, i) = jeb-\l - bf-^ T ^ ^ ^1(1- bYb' db Hord{di^) 



So the Poisson process {ik} has intensity 



1/(71) = j9 



T{i + r) Tii)T{r + 9) 
r(z + l)r(r) T{i + r + 9) 



(12) 
(13) 



So, by Campbell's theorem (Kingman, 1993), 

oo 



i=l 



= j9- 



T{r + 9)^ T{i + r) 



E 



tln^ + r + 9) 



To evaluate the sum X^^i r(i+r+e) ' appeal to a result from Tricomi and Erdelyi (1951): 



+ a) ^ a-b 
T{x + b) 



In particular. 



and 



Tji + r) 
r{i + r + 0) 

Tji + r) 
r{i + r + 9) 



< (i + r)- 



> (i + r)- 



-1) 



2{i + r) 

0(0 - 1) 
2{i + r) 



+ C{i + ry 



C'{i + r)- 



X oo 



for some constant C 



for some constant C 



(14) 
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Before proceeding, we establish for a> 1, 

/.oo 

V(i + r)-" < / {x + r)-" dx 
i=i -^^=0 

= (a - 1)-Vi-" 



and 



So 



and 



/-oo 

= (a-l)-i(r + l)i-" 



r 



Since, for 9 > 1, we have 

(r + iy-e ^^^^ 

it follows that 

y 5' + ""^,, ^{e- i)-'r'-' (16) 

1=1 ^ ' 

From Eq. 14, we also have '"[T^^^^^ ~ as r -> oo. So we conclude that 

^Y] ik] ~ Tx^*^, r ^ 00, 
as desired. □ 



Proof of Lemma 9: The proof proceeds as above. In this case, we have that the Poisson process {bk,il>k,ik} 
has intensity 

So the Poisson process {ik} has intensity 

r(l + 6>) r(z + r) T{i-a)r{r + 9 + a) 

'^'^'^> -^Y{l-a)T{9 + a)r{i + l)T{r) T{i + r + e) 

By Campbell's theorem, 

oo 

fc i=l 

T{l + e) r{r + e + a)^ r{i + r) T{i-a) 

~^r{i-a)r{e + a) f(r) ^r(i + r + 6') r(i) ■ 



We will find the following inequalities, with i > 1 and a e (0, 1), useful: 

(^-a)-"<^^^<(^-l)-". 
We will also find the following integrals useful. Let a > 1. 

oo 

^(* + r)-«(z-a)-" 

oo 

<^(^+^)-«(i_l)-« 

i=2 

< / {x + ryx-"' dx 

Jx=0 

poo 

{y + lyy-" dy 

Jy=0 

^ <,-^+i r(l-a)r(a + a-l) 
r(a) 



Similarly, 



5^0: + r)-«(i-l)-« 

OO 

/•oo 

> / (x + r)-"a;-" da; 

/•oo /'2 

= / (a; + r)~"a;~" rfa; - / (x + r)~"a;~" rfa; 

r(a) ^ ^ 



First, we consider an upper bound. To that end, 

^r(» + r + 0) r(z) 



for some constant C 

-,-„+ir(l-a)r(0 + a-l) 



< r" 



r(0) 

g(g + i) „-,-„ r(i-Q)r(g + i + a-i) 



2 r(6' + i) 

_e-a-ir(l-a)r(0 + a + l) 



r(0 + 2) 

For the lower boun.d, 

^_T{i + r) r{i - a) 

i=1 
> 



Tii + r + 6) V{i) 



i=2 

for some constant C' 
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— r 



r{e + i) 



_ -g-^-i r(i-a)r(e + a + i) 
r{0 + 2) 

It follows from the two bounds above that 



E 



Tji + r) T{z-a) ^ r(l - a)r(g + a - 1) 



Since 



it follows that 



^r(z + r + 0) r(z) m 



r(r) 



^[Z,*/=J T'r(i-a)r(^ + a) r(e) 



'6» + Q!- 1 

as was to be shown. □ 



Proof of Lemma 10: Given an atom hk of the beta process, the probability that the associated negative binomial 
coimt ik is non-zero is 1 — (1 — buY. It follows that 

l{ik > 0}] = E[E[^ l{ik > 0}\bk]] 

k k 

= E['£l-{l-bky] 

k 

= j {!-{!- by )vMdh) 

where vsp is the intensity of beta process atoms {bk}- For integer r, this integral was calculated by Broderick 
et al. (2012) to be - ie\og{r). 

Note that, in applying (Broderick et al., 2012), we are using the form of the negative binomial distribution to 
reinterpret the desired expectation as the expected number of features represented in a beta-Bernoulli process 
with r draws from the same underlying base measure. 

Now consider general r > 1. Let r(°) = \r\ and r*^^) = [r]. Then 

j,{i-{i-br'"'yMdb) ^ /,(i-(i-6)>BP(rfb) ^ ^i-{i-bY''')vMdb) 

76ilog(r(i)) - 76»log(r) " ^e\og(ri^)) ^ ' 

by monotorvicity. Moreover, 

/,(l-(l-6)'-'°^)^.BP(rfb) 

76'log(r(i)) 

_ /,(l-(l-b)'"'VBp(rf6) ^e\og{r^^)) 
76'log(r(o)) ' 76'log(r(i)) 
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1, r — >■ cx). 



Similarly, 



So 



/,(l-(l-6)>Bp(ci6) 



1, r — >■ 00, 



7^ log(r) 

as was to be shown. □ 



Proof of Lemma 11: By the discussion in the previous proposition, this result follows from the results 
in Broderick et al. (2012). □ 



Proof of Lemma 12: Given an atom bk of the beta process, the probability that the associated negative binomial 
count ik is equal to j is NB(j|r, bk)- It follows that 

k k 

= E[^NB(j|r,6fe)] 

k 

= v{j) as above 

r(j + r) r(j)r(r + e) 



= 76» 



Now we use 



and 



to obtain 



r(i + l)r(r) T{j + T + 6) 

r(r + 0) , 

r(r) 



r(j + r + 0) 



□ 



Proof of Lemma 13: As in the BNBP case, we have 

EEife=i}] = Ki) 

k 

r(l + e) r{j + r) r(j - a)T{r + e + a) 



= 7 



r(l - a)r(^ + a) T{j + l)r(r) r(j + r + e) 
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Now we use 



r{r + 9 + a) 
r(r) 



and 

r(j + r) 



r 



r(j + r - 

to obtain 

r(i + e) T{j - a) 



my iHk = ill ~ 7 ^^-^-"^ 



r 



□ 



Proof of Theorem 14: Assume 6* > 1. We have from the previous discussion that 

lim 41 = 1. 



So 

lim log(^(r)) — log(r) = — log ( 7 



Hence 



since log(r) — )• 00 as r — >■ 00. 
From Lemma 10, we also have 



Finally, then. 



9-1 



Hmi^^ = l 



lim -JW = 1. 



■ 7^ log(r) 



lim - 1 

r-)-oo 7^1og(^(r)) 



□ 



Proof of Theorem 15: From above, we have 



So 



From Lemma 11, we also have 



lim 

r— >oo 'y 



e+a-1' 



lim 

r— >-oo 



1- K{r) a.s. , 

lim r<a^^\ = ^ 

a r(6l+a) ' 
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So 

VSV a T(e+a) a.s. , 
hm -a = 1, 

□ 



Proof of Theorem 16: As above, we have from Lemma 13 that 

lim - 



So 



r— )-oo 



. Tji+e) r(j-a) 
'r(i-a)r(e+a) ro+i) ' 



r(i+fl) rg-g) 

-a)r(e+a) r( 

(75+^T)"$,(r) 

yielding the desired result. □ 



Appendix D 
conjugacy 

D.1 Beta process and Bernoulli process 

The following theorem describes in full detail the conjugacy of the beta process prior with respect to the 
Bernoulli process likelihood as suirvmarized in Theorem 1 and as estabUshed for feature models by Thibaux 
and Jordan (2007). 

Theorem 17. Let H be a measure with atomic component Hjix — J2f=iPi^^i continuous component Hard- Let 
9 and 7 be strictly positive scalars. Consider N conditionally-independent draws from the Bernoulli process: In = 

^f=iifix,n,i5ui + iord.n,j6y. '-^ BcP(_B), foT u ^ 1, . . . , N wUk B ^ BP{9, 7, H). That is, the Bernoulli process 
draws have J atoms that are not located at the atoms of Hjix. Then, B\Ii, . . . ,1m ~ BP {Opost, "/post, Hpost) with 
Opost = + N, 'jpost = Tgipjv' fl"'^ Hpost,ord = Hard- Further, Hpostjix = J2f=i PposU^m + Z]j=i ^postjSy^, where 

Ppost,l = Pl + (^'7)"^ J2n=l ifix,n,l md ^postj = {S'f)'^ J2n=l iord,n,j- 

D.2 Reparameterlzed beta process and Bernoulli process 

The following corollary describes in full detail the conjugacy of the reparameterlzed beta process prior with 
respect to the Bernoulli process likelihood as summarized in Corollary 2. 

Corollary 18. Assume the conditions of Theorem 1, and consider N conditionally-independent Bernoulli process draws: 
In = Ef=i + T,j=iiord,n,jSy. ~ BeP{B), for 71 = l,...,N with B ~ RBP{9,"f,u, p,a, Hard) and 

{Pi}iLi and {o-J^i strictly positive scalars. Then, B\h, ...,1^^ RBP (^post, 7post, Upost, Pposu <^post, Hpost,ord), for 
Opost = 0-\-N, -jpost = le^, Hpost,ord = Hord, and L + J fixed atoms, {upost,i'} = {ui}iLi U {t^j}/=i. The Pp„^^ 
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and cTpost parameters satisfy Ppost,i = Pi + J2n=i ifix,n,i and <jpost,i = cri + N - J2n=i ifix,n,i for I e {1, ... , L) and 

Ppost,L+j = J2n=l '>'ord,n,j md Crpost,L+j = + N - ^^=1 ^ord^nj for j € {1, . . . , J}. 

D.3 Reparameterized beta process and negative binomial process 

The following theorem is a corollary of Theorems 20 and 21 below and provides full detail for Theorem 3 in 
the main text. In particular. Theorems 20 and 21, give us the form of the posterior process when we have a 

general CRM prior with a Poisson process intensity with finite mean. Choosing the particular Poisson process 
intensity for the RBP and choosing the distributions of the prior fixed weights yields the result. 

Theorem 19. Let and 7 be strictly positive scalars. Let {ui,..., ul) G Let the members of and {cr;}^^ 

be strictly positive scalars. Let Hard be continuous measure on ^. Consider the following model for N draws from 
a negative binomial process: In = J2t=i'''fix,n,iSui + '''ord,n,jSy^ ~ NBP(S), /or n = 1,...,N with B ~ 
RBP(0, 7, u, p, cr, Hord)- That is, the negative binomial process draws have J atoms that are not located at the atoms 

of H fix- Then, B\Ii, . . . , In ^ ^BP{dpost,"fpost,Upost, pposn ^posi, IIpost,ord) for 6.post = + Nr, 'jpost = 1 e+Nr ' 
IIpost,ord = Hord, and L + J fixed atoms, {upostA = U The Pp^^^ and aposi parameters satisfy 

Ppostd = Pi+ Z)n=i ^fix,n,i cind apost,i = cTi + vN for I £ {1, . . . , L} and Ppost,L+j = Yln=i iord^nj and apost,L+j = 
+ rN for j e {1,..., J}. 

D.4 Finite Poisson process intensity 

Theorem 20. Let Bprior be a discrete, completely random measure on [0, 1] with atom locations in [0, 1]. Suppose it 
has the following components. 

• The ordinary component is generated from a Poisson point process with intensity v{dh) dip such that v is continuous 
and i/[0, 1] < 00. In particular, the weights are in the b axis, and the atom locations are in the tp axis. 

• There are L fixed atoms at locations ui,. . . ,ul e [0,1]. The weight of the Ith fixed atom is a random variable with 
distribution hi. 

• There is no deterministic measure component. 

Draw a negative binomial process I with shape parameter r and input measure Bprior- Let K be the number of 
(nonzero) atoms of I. Let IT = {{ik, Sk)}^^i be the pairs of observed nonzero counts and corresponding atom locations. 

Then the posterior process for the input measure to the negative binomial process given I is a completely random 
measure Bpost 'with the following components. 

• The ordinary component is generated from a Poisson point process with intensity 

(1 - bYiy{db) dip. 

• There are three sets of fixed atoms. 

1) There are the old, repeated fixed atoms. Ifui = Sk for some k, there is a fixed atom at ui with weight density 

c-r^{l-byb'>'hi{db) dip, 
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where the Cor is the normalizing constant: 

Cor= [ I {l-byb'-hi{db). 

2) There are the old, unrepealed fixed atoms. Ifui ^ {si, . . . , sk}, there is a fixed atom at ui with weight density 

c-^{i-bYhiidb), 

where the Cor is the normalizing constant: 

cou= f ii-byhi{db). 

Jb=0 

3) There are the new fixed atoms. If Sk ^ {ui, . . .,ul}, there is a fixed atom at Sk with weight density 

c-Ui-byu-uidb), 

where the Cnew is the normalizing constant: 

Cnew= I {l-byV''v{db). 

• There is no deterministic measure component. 

Proof. Let (971, Eot) be the set of completely random measures on [0, 1] with weights in [0, 1] and its 
associated sigma algebra. Let (©, Eg) be the set of completely random measures on [0, 1] with atom weights 
in {1, 2, . . .} and its associated sigma algebra. For any sets M G Sgjt and G G Eg, let Pprior(Af x G) be the 
probability distribution induced on such sets by the prior construction of the prior measure Bprior and the 
negative binomial process /. Let Q{Ad : G) be the probability distribution induced on measures in by the 
proposed posterior distribution. Finally, let Fmarg{G) be the prior marginal distribution on counting measures 
in (5. To prove the theorem, it is enough to show that, for any such sets M and G, we have 

¥p„or(MxG)= [ Q{M -.IjF^argil) (22) 

JieG 

The remainder of the proof will proceed as follows. We start by introducing some further notation. Then 
we will note that it is enough to prove Eq. 22 for certain, restricted forms of the sets M and G. Next, we will 
in turn find the form of each of (1) the prior distribution Fprior, (2) the proposed posterior distribution Q, and 
(3) the marginal count process distribution Pmorg for our special sets of interest. Finally, we will show that 
we can integrate out the posterior with respect to the marginal in order to obtain the prior, as in Eq. 22. 

Start by noting that we can write Bprior as 

J L 

Bprior (dtp) = X] ^j^^i C*^^) + X] ('^^)- 

3=1 1=1 

Here, J is the number of atoms in the ordinary component of Bprior- So the total number of atoms in Bprior is 
J + L, and the total number of atoms in the coimting measure with parameter Bprior is K < J + L. The atom 
locations of the ordinary component are {vj} as the fixed atom locations are at {ui}. Together, we have that the 
full set of atoms of the counting measure is some subset of the disjoint imion {sk}k=i ^ {vjjj^i U The 
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atom weights at the fixed {ui} locations are {r/i}, and the atom weights at the ordinary component locations 
{vj} are {^j}. 

Let A = i/[0, 1], which we know to be finite by assumption. Then the number of atoms in the ordinary 
component is Poisson-distributed: 

J ~ Pois(A). 

And the are independent and identically distributed random variables with values in [0, 1] such that 

each has density v{dh)/X. 

Next, we note that instead of general sets M and G, we can restrict to sets of the form 

J L 

M' = {j = j}n f]{vj < vj,^j < n f]{vi < m}. (24) 

i=i 1=1 
G' = {K = l}n{h=h,si=h} (25) 

That is, in the random measure Bprior case, we consider a set with a fixed number j of ordinary component 
atoms and with fixed upper boimds ^j, or ffi on, respectively, the location of the ordinary component atoms, 
the weights of the ordinary component atoms, and the weights of the fixed atoms. In the coimting measure 
/ case, we can restrict to a single atom with location at si and count equal to ii € {1,2....}. 

With this notation and restriction in hand, we proceed to compute the prior, marginal, and posterior so that 
we may check whether Eq. 22 holds. 

D.4.0.1 Prior.: We first calculate the prior measure of set M'. Recall that the number of atoms is Poisson- 
distributed: 



J! 



^ priori, J — — ^1 ^ ' (^^) 



Also, the locations of these atoms, given their number, is distributed as 

Ppr.or( n {Vj<Vj]\J=J) = J\ / • • • / n '^^^^ • (2^) 

j=i Jti>i=o Jii>2=ilii Ji'j=il'j-i \ j=i / 

Note that V(^j) denotes the jth order statistic of and the J! term results from enumerating the possible 

rearrangements of this set. Finally, the sizes of the atoms, given their location and number, have the distribution 

J L J 



L prior 



3 = 1 1=1 3=1 



3=1 



6=0 



n / hi^db) 

Jb=0 



(28) 



(29) 



Together, Eqs. 26, 27, and 28 yield the prior probability of the set M' (Eq. 24) describing the random measure 
Next, we tiirn to the prior probability of the set G' describing the counting measure /. hx this case, we 
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condition on a particular measure e M' . Now, in G' , each counting measure I has exactly one atom. This 
atom can occur either at an atom in the ordinary component of n, located at one of {vjYj^-^, or at a fixed 
atom of /X, located at one of We take advantage of the fact that the ui are unique by assumption and 

that the Vj are a.s. imique and distinct from the ui by the assumption that the distribution on locations is 

continuous. We also note that on the set {si < si), we need only consider those atoms with locations at most 
si. Thus, we break into these two special cases as follows: 



'^pTior{K = l,n = il,Sl < Sl|/U) 



y^Vprior{K =l,ii= ii,Si = Vj\lj)l{Vj < Si} 



L 



y^^FpriorjK = l,ii= ii,Si = Ul\ll)l{ui < Si}. 



1=1 

The probability that the single nonzero count occurs at a particular atom is the probability that a nonzero 
count appears at this atom and zero counts appear at all other atoms. To express this probability, we first 
define a new function: 

$(J,L,v,4,r7,ii,s) = |n [NB(0|r,^,)]^^''^^^> [NB(H|r,^,)]^{''^=^> 

C L 



Kl=l ) 

Here, NB(a;|a, h) is the negative binomial density. A notable special case is NB(0|a, 6) = (1 — 6)". We can write 
the single-atom probabilities with the $ notation: 

Fprior{K = l,ii = ii,Si = Vj\ll) = ^{J,L,V,^,r),ii,Vj) 
Fprior{K = l,ii= ii,Si = UjlfJ.) = ^{J,L,v,^,'n,ii,ui). 

We can combine the likelihood of the counting process / given the random measure Bprior with the prior 
of the random measure Bprior to find the joint prior probability of the set M' x G'. If we use the following 
notation to express the sets over which we will integrate, 

J 

R{v, J) 4 {V> : e [0, 1]^, ^1 < • • • < Vj} n fl : V-j < %} 
r(T = (ii,...,tj), J) 4 [0,ti] X ■•■ X [0,tj] 
then we may write 

FprioriM' X G') 

{G'\B) (IP 

prior 

Jb€M' 

J [ p 

~^ l{Vj < Si} 

veR{v,J),^er{i,J),ver{r),L) 
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Hui < Si} 



(30) 



This equation completes oui prior calculation for now. We will retixrn to it when we evaluate Eq. 22 for sets 
M' and G'. 

D.4.0.2 Proposed posterior.: Next we consider the proposed posterior distribution Q. Just as we calcu- 
lated the probability of M' x G' under the measure induced by our prior generative model, we can analogously 
calculate the quantity Q{M' : I) for some I e G' according to the definition of Q. 

In the theorem statement, we specified a construction of completely random measure to induce the proposed 
posterior. In this case, the completely random measure has an ordinary component and a set of fixed atoms. 
Given the specific set G' we are considering (Eq. 25, the set of locations of the fixed atoms is {ui, . . . , uz,}u{si}, 
where the imion is not necessarily disjoint. So there are two cases we must examine: either the counting process 
atom is at the same location as a fixed atom of the prior random measure (si = ui for some Z e {1, . . . , L}) or 
it is at a different location (si ^ {ui, . .. ,ul)- 

First, we consider the case where the coimting process atom location si is the same as that of a fixed 
atom of the prior random measure, say ui*. As before, the number of atoms in the ordinary component is 
Poisson-distributed with mean equal to the total Poisson point process mass 



So we have (c.f. Eq. 26) 



Xpost= I {i-hyu{dh). 

Jb=0 



q{J = J : K =l,ii= ii, si =ui*) = -^e-^""''. 



(31) 



Also, as for Eq. 27, we can calculate the distribution of the locations of the ordinary component atoms: 

J 

Q{f]{vj <Vj}\J = J : K = l,ii=ii,si=ui,) 



J! 



"(1) /■"(2) 



'(J) 



(32) 



And again, as in Eq. 28, the sizes of the atoms, given their location and number, have the distribution 



J 



: K 

J 

n 

i=i 



NB(0|r,fo)i/(d6) 



6=0 



^post 



iJo[NB(zi|r,6)]i{'='*>[NB(0|r,6)]i{''^'*>/iKrf6) 
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(33) 



fJilUi^B{h\r,b)]Hi=nm{0\r,b)]Hm^}hi{db)_ 
Putting together Eqs. 31, 32, and 33, we can find the proposed measure of the set M' given I G G' for the 
case si = ui* : 

Q{M' : I) 

J = J,f]{Vj< Vj}, 



fl {0 <ij}n f]{m <rii}:K=l,H=iuSi= ui. 



/6fl(v,J),C6r(£,J),Tj6r(r7,L) 



JveR(v, 



^{J,L,w,$,r),ii,ui*) 



(34) 



where 



L ,.1 



Cfi.ed,i' = n / [NB(ii|r,6)]i^'='*>[NB(0|r,6)]i^'^'*>/iz(d6) 

Second, we consider the case si ^ . . . ,ul}. Then si = i^j* for some j* e {!,..., J}. Suppose that vj, 
is the jorderth Smallest element of {tii, . . . ,v,j}. Note that jorder is well-defined since the density of the vj is 
continuous. We proceed as above and start by noting that the number of atoms on either side of the location 
Vj- is Poisson-distributed: 



' = J : K = = ii, si = Vj* 



border)' 



(J j order} 



{j order f ) • 

Further, we have the usual distribution for the atom locations on either side of Vj- : 

Pi {'Vj < Vj}\J = J : K = =ii,si = Vj» 



(35) 



= {jorder - 1)! / / ' ' ' / 



jorde 



n T 



dijjj 



(36) 



'aorder + l-^Oorder) •' i' J-i' J-l \j=jorder + i 



As usual, the third step identifies the conditional weight distribution of the atom weights: 



Q flte < ij} n f]{m < m}\J = 1 Cli^j < ij} 



(37) 
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4//=o NB(n|r,6) 



i{j=j'} 



[NB{0\r,b)]^^'^''K{db) 



S /j,Lo [NB(ii|r, 6)1 ' ' [NB(0|r, &)]'^^'^^'^ iy{db) 
^ £oNB(0|r,6)/j,(d6) 



(38) 



Lti/6=oNB(0|r,6)/i,(d6) 

So — combining Eqs. 35, 36, and 38 — we find that the proposed posterior distribution in the case si = Vj- is 

'■•I) 

J 

J=J,^{Vj< Vj], 
J L 

i=i 



vefl(v,J),Cer(£,J),r/er(77,L) 



$(J,L,v,^,?7,ii,?;j.) 



J 



(39) 



where 



a.<i= (^'^NB(ii|r,6)Krf6)^ ■ |^_Q^'^NB(0|r,6)/iKrf6)^ • 

Putting together the cases si = u;. for some 1* (Eq. 34) and si ^ {ui, . . . , wj,} (Eq. 39), we obtain the full 
proposed posterior distribution: 

'■■I) 

J 

J = J,{^{Vj< Vj), 

J L 

{^{ij<ij}r^[\{m<m}.K = i, h = n, si = h 

j=i ;=i 



ifeii(v,j),€er(^,j),77ei-(n,L) 



$(J,L,v,^,?7,ii,u;.) 



+ 1 {si ^ {uu . . .,ul}} C-i,e-^-- / 

J V 



Gii(v,J),€er-(^,7),r/er(n,i) 



$(J,L,v,4,?7,ii,Uj.) 



(40) 



D.4.0.3 Counting process marginal.: With the prior and proposed posterior in hand, it remains to 
calculate the marginal distribution of the counting process. Then we may integrate out the proposed posterior 
with respect to the counting process marginal in order to obtain the prior (Eq. 22). Since we are focusing on 
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counting process sets G' of the form in Eq. 25, we aim to calculate 

^marg{K = 1,H = H,Si < Si). 

In our calculations above, we also worked with a set of prior measure G M' and therefore worked with 

a set of locations for the ordinary component atoms. In this case, we will need to calculate the probability of 

zero counts in an interval where the number and location of the ordinary component atoms is integrated out. 

Let /'{V'} be the counting process that includes exactly those counts at ordinary component atoms and not 

the counts at fixed atoms; we can see, e.g., that < /{V"} at all ^- Further, similar to Eq. 23, let Bord be 

the random measure composed only of those atoms in the ordinary component of Bprior- 

J 

^ord ~ ^ ^ ^Vj ' 

Then we are interested in the quantity: 

E[i{Vie(Vi,V2),m = o}] 

= E n il~Bord{t}y 

_«e(V''i,'i/'2) 

= H {l-E[l-{l-Bord{t}r]), 

where the last line follows from the independence of Bprior across increments. 

Now define a new process B' = 1 — {1 — BordY- This process has intensity i/ , which can be obtained by a 
change of variables from the Poisson process intensity u of Bord- We will find it notationally useful to refer to 

v' though we do not calculate it here. Also, let B' be the mean process of B': B^dtp) = E [B'{d^l))]. With this 
notation in hand, we can write 

E[i{Vie (V'i,V2),s'{t} = 0}] 

= TT (l-S'W) =exp|- =exp|- f hv'{db)\ 

= exp (-(^2 - Vi) / (1 - (1 - by) v{db) 

I Jb=0 

As usual, we consider two separate cases. First, suppose si = ui* for some 1* G {!,..., L}. Then, using 
independence of increments of the prior random measure and counting process, we find 

= "^marailM = H)Pmar9(V/ 7^ l*J{ui} = 0) 
^ marg 

(VtG(0,l),7'{t} = 0) 
• exp - 0) - (1 - by) u{db)^ 
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Next, suppose si ^ {ui, . . . , ul}. Then 

Pmars(-f = !,«!= 5l,Sl < Si) 



(41) 



,(iir = l,ii = iijsi = s 



< V') 



] 

'4>=o 



7 



marglVt G (0,1)\{V'},/'W = 0)C 

NB{ii\r,b)u{db) 

^=0 \Jb=n 

L „i 

NB(0|r, 6)/i(((i6) 

6=0 



< V') 



n 



(42) 



D.4.0.4 Checking integration.: The final step is to note that we may integrate out the proposed posterior 
in Eq. 40 with respect to the marginal described by Eqs. 41 and 42 to obtain the joint prior in Eq. 30. This 

integration is exactly the one we desired from Eq. 22 in the special case of sets of the form M' in Eq. 24 and 
G' in Eq. 25, as was to be shown. □ 

D.5 Infinite Poisson process intensity 

Theorem 21. Theorem 20 still applies when the intensity measure v does not necessarily have a finite integral j/fO, 1] 
but satisfies the (weaker) condition 

' b y{db) < oo. (43) 



Proof: The main idea behind the proof of Theorem 21 is to take advantage of the finiteness condition in 
Eq. 44 to construct a sequence of finite intensity measures tending to the true intensity measure of the process. 
We will use the known form of the posterior in the finite case from Theorem 20 to deduce the form of the 
posterior in the case where v merely satisfies the weaker condition in Eq. 43, which we note implies 

Ve > O,z^[e,oo) < 00. (44) 

We therefore start by defining the sequence of (finite) measures by 

Un{A) = I l{b> l/n}iy{db), for all measurable A c [0, 1] (45) 

JbeA 

Further, we may generate a random measure Bprior,n as described by the prior in Theorem 20 with Poisson 
point process intensity And we may generate a counting process 7„ with parameters r and Bprior,n as 
described in Theorem 20. 

As before, let Pprior be the prior distribution on the prior random measure Bprior and the counting process 
/. Let Eprior denote the expectation with respect to this distribution. Further, let fmarg represent the marginal 
distribution on the coimting process from f prior- And let Q(M : G) represent the proposed posterior distri- 
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button on sets M e 9Jl given any set G G S©. We use the same notatton, but with n subscripts, to denote the 
case with finite intensity 

Our proof will take advantage of Laplacian-style characterizations of distributions. In particular, we note that 
in order to prove Theorem 21, it is enough to show that, for arbitrary continuous and nonnegative functions 
/ and g (i.e., f,g^ C+[0, 1]), we have 



/ / exp\- f' {gimW + fmW)] dQ{B : I) dFmarg{I) 

expj- /' {g{i;)B{i;} + f{i;)I{i;}) 

I Jtb=0 



= Epwo. exp <( - / {gimW + fWIW) } ■ (46) 

By Lemma 22, we have the following limit for all f,g G C+ [0, 1] as n ^ oo and hence the finite intensity Un 
approaches the potentially infinite v. 



^prior,n 



^prior 



exp(- /' (5(V')i3{V} + /(V'W})) 
exp(-/ (5(V)BW + /(s)/{V'}) 



Therefore, by Eq. 46 and the observation that Theorem 20 holds under the finite intensity z/„, we see that it is 
enough to show that 

/ / exp (- /' (5(V)S{V'} + fmm)] dQn{B : /) dFmarg,n{I) 

^ I j exp (- (5(V')S{V} + fmW)] dQ{B : I) dFmarg{I), n ^ oo (47) 
JBemiJie& I Ji/,=o J 



Define 



^n{I)= j expj-/' 5(^)S{V}|rfQ„(B:/) (48) 
\E'(/)^/ expj-/' g{i;)B{i;}] dq{B : I) (49) 

By Lemma 23, we have 

/ exp |- / (*n(/) - *(/))rfP™arg,n(/) ^ 0. (50) 

And Lemma 22 together with the fact that exp |— J^^^ f{tp)I{'tp}^ ^'(7) is a bounded function of I yields 

/ expj-/ f{'llj)I{'llj}]^{I){dFmarg,n{I)-dPmarg{I))^0. (51) 

Combining Eqs. 50 and 51 yields the desired limit in Eq. 47. 

□ 

Lemma 22. Let Bprior.n i completely random measure with a finite set affixed atoms in [0, 1] and with the Poisson 
process intensity Vn in Eq. 45, where v satisfies Eq. 43. Let In be drawn as a negative binomial process with parameters 
Bprior,n md Bprior,n- Similarly, let Bprior be a completely random measure with Poisson process intensity v, and let 
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/ he drawn as a negative binomial process with parameters r and Bprior- Then 
Proof: It is enough to show that, for all f,g G C+[0, 1], we have 



E, 



'prior, n 



exp 



cxp 



■!/. = 



n 0. 



We can construct a new completely random measure, i?„ by keeping only those jumps from Bprior (generated 
with intensity v) that are either at the fixed atom locations or have height at least 1/n. Then I3„ ^ B for B 
generated with intensity Un. Let J„ be the counting process generated with parameters r and B„. Then it is 
enough to show 



exp 



U! = 



'prior 



exp 



n 0. 



Let B^ = Bprior — Bn be the completely random measure consisting only of an ordinary component 
with jumps of size less than 1/n. Let /,7 be a counting process with parameters r and B~. Then, using 
the independence of Bn arid B~, we have 



E„ 



exp 



i/'=0 



:E, 



'prior 



exp 



(3(V)S{V} + /(V')/{V'}) 



V'=o 



•E„ 



exp 



=0 



So it is enough to show that 



E„ 



exp 



4>=o 



0, n — >■ 00. 



(52) 



In order to show Eq. 52 holds, we establish the following upper bound: 

(5(V)s-W + /(«)/- W) 



E, 



'prior 



<E„ 



cxp 



exp |(maxff(V))B-[0, 1] + (max/(V) / /"{V} 



< max < E. 



exp<^2(max5(V'))B-[0,l] 



E, 



'prior 



exp < 2(max£((^)) 



4"W 



(53) 



We will prove the result by showing that both arguments to the outer maximum function approach zero as 
n — >■ oo. First, consider the random measure B~. Since -B~[0, 1] is distributed as a Poisson random variable 



52 



with mean 

b v{db), 



b=0 



we have from the moment-generating function of a Poisson random variable that 

Cl/r. 



exp |2(max5(^/>))5^[0, 1]| = exp J ^e^i^^M^am - l) ^ h v{dh) 



Since max^ g{i)) < 00 and jlL^ b u{db) as n — >■ by Eq. 43, the righthand side has limit one as n — >^ 0. 
Next, we note that /~ is also Poisson-distributed, with parameter 



^ J cb-\i - bf-'I^^J^~ jii - byb^ db 

where C is a constant in n (c.f. Eq. 12) 

= C ^(2/n)™ j^J\br-\l - 2eby+'-' ~ db 

< 2^+^-\l/n) V C / 1""-^! - by+^-^ ( 



'm + r — 1 
m 



= 2^+^-1 (l/n)C", 

where C" is a constant in n (by Lemma 8). As in the B~ case, we see that the Poisson mean parameter goes 

to zero as n — > oo, and the reasoning of the -B^ case leads us to further conclude that 



expi 2(maxs'(V')) / 



0, n — >■ 00. 



Thus, by Eq. 53, we conclude that the desired limit in Eq. 52 holds. □ 
Lemma 23. For and $ defined in, respectively, Eqs. 48 and 49, we have the limit in Eq. 50: 

[ exp |- / /(^)/{^} j (*„(/) - *(/))dP™ar3,„(/) ^ 0. (54) 

Proof: We start by choosing n large enough so that (1) the difference between the ordinary components in 
the truncated case and the non-truncated case are, in some sense, small enough and (2) the number of atoms 
in the trimcated case is boimded with high probability. Under these two conditions, we will then show that 
^n{I) and ^{I) are sufficiently close in value by examining in turn each of the various types of atoms in the 
proposed posterior. 

Therefore, choose e > 0. First note that by the assumption of finite integration of u (Eq. 43) we can choose 
no such that for all n > uq we have 

pl/n 

/ biy{db) < e. (55) 
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This choice implies the existence of ni such that for all n > ni and alH > 1 we have Eq. 55 as well as 



l/n 



6^(1 - byv{db) < e. 



(56) 



5=0 



Second, since / ^marg,n approaches / ~ ^marg in distribution by Lemma 22, there exist constants K' and 
n2 such that the number of atoms Kn of /„ satisfies 

^marg{Kn > K') < e for all n > 712- 

It remains to use these conditions to bound 



/ exp(- / fi^Pm}] 



He® K JjIi=o 

For instance, since ^nil) and ^'(7) are both bounded between zero and one, we have that 



fWIW \ (*„(/) - *(/))dP„ar3,n(/) 



< 



2€+ f - *(7)| dFmargAI)- 



(57) 



K„<K 



Next, we need to bound the second term on the righthand side of Eq. 57. To that end, we break 'i'n and 
* into their three constituent parts: the fixed atoms from the prior, the new fixed atoms in the proposed 
posterior, and the ordinary component in the proposed posterior. For we have 



*„(/) = / exp (- ff(V)S{V}| Q{B : /) 



exp <; - Yl gWBW 

i/':/{'i/'}>l,V'^{«l, •••,"!-} 

-f2g{ui)B{ui}- [ g{7P)Bordm\Q{B:I) 
1=1 -'^=0 J 



Bern. 

L 



H I exp{-5(V)B{V}} 

W [ e^p{-g{ui)B{ui}} 

exp <^ - / gWBordW 



/' exp{-5(V)6}6^W>(l - bYMdb) 

Jb=0 



n 

4!:I{'4>}>1,iP({ui,...,Ul} 

n / e^p{-g{ui)B{ui}} 

exp(/' /' (l-e^'W'')(l-6)V„(d6) 

The analogous formula holds for ^ by removing the n subscripts. 

With the formulas for and ^ in hand, we turn again to our desired bound. We apply Lemma 3 from 
Kim (1999) to transform the difference in ^„ and ^ into separate differences in each component, where we 
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note that the prior fixed atom component is shared and therefore disappears. 

(I) 



< 



ie<s 

Kr,<K' 



E 

iI):I{iP}>1,iIj^{ui,...,ul} 



b=0 



- C„ 



exp 



b=0 
1 /.I 



exp{-ff(^)6}6^W>(l-6)V„(d6) 
exp{-g{^)b}b'^'''^l - hfuidh) 
(l - e^^^^'') (1 - hfunidh) 



/b=0 Jip=0 

-expj^ J (l-e9('^)'') (l-6)V(d6)| 



< Ce, 

for a constant C with no dependence on n or e. 
Together with Eq. 57, this bound completes the proof. 



□ 



Appendix E 

Posterior Inference Details 

E.1 Exact Gibbs slice sampler 

We sample b^^k and tpk from their Gibbs conditionals as follows: 

E.1.0.5 Sample ipk- The conditional posterior of tpk given z. . and x. . is proportional to 

D N,t 
d=l n=l 

This has a closed form when H is conjugate to F{ijjk) and may otherwise be sampled using a generic univariate 
sampling procedure (e.g., random-walk Metropolis-Hastings or slice sampling). 

E.1.0.6 Sample b^x- By beta-negative binomial conjugacy, the conditional posterior of b^^k given z^,. and 
bo^k is a beta distribution, 

bd,k ~ Beta(7d^d?>o,fc + N^^k, - 'ydbo,k) + rd), 

where Nd,k = Y.nKzd,n = k). 

E.1.0.7 Sample 6o,fc: To sample the shared beta process weights 6o,fe/ we turn to the size-biased construc- 
tion of the beta process introduced by Thibaux and Jordan (2007) 



where 



Cm '~ Pois 



\6o+m 



m=0 i=\ 



bo,m,i Beta(l, + m), and ipm,i,- H. 
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If we order the atoms by the rounds in which they were drawn, then the fc-th atom overall was drawn in 
round ruk, where 

mfc = min < m : > C, > fc > . 

\ ^0 - 1 

Conditional on the round indices (mfe)^^, we have 

oo 

fc=l 

for 

bo,k '~ Beta(l, + Wfe) and Vfe ~ H. 
The conditional density of 6o,fe given the remaining variables is therefore proportional to 

and may be sampled using random-walk Metropolis-Hastings. 

It remains then to sample the latent round indices irik or, equivalently, their differences hk = mu — ruk-i, 
where mo = for notational convenience. Let and denote the pmf and cdf of the Pois(^^) distribution 
respectively, and define Cm,j = J2k=i ^('^fe = ™)- Since Cm = J2k=i K''^k = w) ~ Pois(^^), it follows that 

¥{hk < I {hj)''zl) = 0, 



^3 

¥{hk = o\{hj)';zl) 



k-l\ _ 1 Jmfc_i (C'mfc-i,fc-l) 

1 - Fmk_j_{Cmk-i,k-l - 1) 



for ruk-i = hj, and 



F{hk = h I ihj)';zl) = 

{, '/^ ^ 7t(1 - /nife_i+h(0)) TT /mfc_i+s(0) 

i---t'mk-i[Cmk-i,k-l-i-) 

for all h gN. The conditional distribution of hk given {hj)j~l and bo,k is then 

P{hk I {hj)^jZl,bo,k) (X (1 - bo,kf'' {Bo + hk + mk-i)p{hk \ {hj)';zl), 

which cannot be normalized in closed form due to the infinite summation. To permit posterior sampling of 
hk, we introduce an auxiliary variable Vk with conditional distribution 

Vk^Vmf{0,^o,hd^-bo,kf'') 

where (Co,h)^i is a fixed positive sequence with lim/j^oo io,h = 0. Given Vk, we may slice sample hk from the 
finite distribution 

p{hk I {hj)';zlbo,k) oc < ioMl - bo,kr^) ^^^ ^^^^ mk-i)p{hk I {hj)']zl). 
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E.2 Collapsed sampling 

In Eq. 58, we sampled 60, fc conditional on b. fe. A more efficient alternative is to integrate b.^^ out of this 
conditional. We exploit the conjugacy of the beta and negative binomial distributions to derive the conditional 
distribution of Nd,k given bo,k, Id, Od, and r^: 

p{Nd,k I bd,k,rd)p{bd,k I bo,k,ld,Od)dbd,k 



/' 

/ Nd,k\ r(rd) T{^d6dbo,k) r(^d(l - 7d6o,fe)) 

r(A^d,fc + Td) TiOd) T{Nd,k + 'rdOdbo,k) T{rd + - 7dbo,k)) 



-dbd,k 



Nd,k\ r(rd) V{Nd,k + rd + 6d) T{^dOdbo,k) r(^d(l - jdbo,k)) ' 
The conditional density of bo,k with b.,fe integrated out now takes the form 

xeo+mfc-i Y] r(^d,fc + ldOdbo,k) r(rrf + Odjl - 7d&o,fc)) 

°' H r{'ydedbo,k)r{ed{i-7dbo,k)) 

and may be sampled using random-walk Metropolis-Hastings. 
E.3 Finite approximation Gibbs sampler 

The full conditional distribution of 6o,fc under the finite approximation of Eq. 11 is proportional to 

D 1 / ^ \ ldSdbo,k 

'alo/ K-1 , 



,eo7o/if-i/i _ , -,eo(i-7o/if)-i TT t / "-^.^ 

^ T{^dedbo,k)nOd{l - ldbo,k)) V 1 - bd,k 



while the conditional density with b.^^ integrated out is proportional to 

feo7o/K-lf. _ , .eo{l-yo/K)-l TT ^{Nd,k + ldOdbo,k) ^{vd + - ldbo,k)) 

^ r(7<ie,6o,fe)r(^rf(l-7<i6o,fe)) 
Random- walk Metropolis-Hastings may be used to sample 60, fc from either distribution. 

With this approximation in hand, we sample Xd.k, bd,k, and ijjk precisely as described in Section 8.1. Since 
the number of components is finite, no auxiliary slice variables are needed to sample the component indices. 
Hence, we may sample Zd^n from its discrete conditional distribution 

^{Zd,n = k) OC F{dXd,n \ '^k)>^d,k 

given the remaining variables. 



